Executive Summary of AI and ET
Executive Summary of AI and ET
Executive Summary of AI and ET
Emerging Technologies
Prepared by:
Dr. Mejdal Alqahtani
Table of contents
1. Overview on AI
2. Basic math and probabilities
3. Machine learning
3.1. Supervised learning
3.2. Unsupervised learning
4. Deep learning
5. Reinforcement learning
6. Programing languages
6.1. Python language
6.2. R language
6.3. SQL language
7. Business Intelligence
7.1. Tableau
7.2. Power BI
7.3. Data Visualization
8. Emerging Technologies
8.1. Artificial Intelligence (AI)
8.2. Internet of Things (IoT)
8.3. Robotic Process Automation (RPA)
8.4. Cloud Computing
8.5. Quantum Computing
8.6. 3D Printing
8.7. 5th Generation Network (5G)
8.8. Extended Reality (XR)
8.9. Blockchain
8.10. Cyber Security
1. Overview on AI
2. Basic math and probabilities
CS 229 – Machine Learning https://stanford.edu/~shervine
VIP Refresher: Linear Algebra and Calculus • outer product: for x ∈ Rm , y ∈ Rn , we have:
x1 y1 ··· x1 yn
Afshine Amidi and Shervine Amidi xy T = .. .. ∈ Rm×n
. .
xm y1 ··· xm yn
October 6, 2018
r Matrix-vector multiplication – The product of matrix A ∈ Rm×n and vector x ∈ Rn is a
vector of size Rm , such that:
General notations
aT
r,1 x n
r Vector – We note x ∈ Rn a vector with n entries, where xi ∈ R is the ith entry:
..
X
x1 ! Ax = = ac,i xi ∈ Rm
x2 .
x= .. ∈ Rn
T
ar,m x i=1
.
xn
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
where aT
r Matrix – We note A ∈ Rm×n a matrix with m rows and n columns, where Ai,j ∈ R is the of x.
entry located in the ith row and j th column: r Matrix-matrix multiplication – The product of matrices A ∈ Rm×n and B ∈ Rn×p is a
A1,1 · · · A1,n matrix of size Rn×p , such that:
!
A= .. .. ∈ Rm×n
. . aT aT
r,1 bc,1 ··· r,1 bc,p n
Am,1 · · · Am,n
.. ..
X
AB = = ac,i bT
r,i ∈ R
n×p
Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly . .
T
ar,m bc,1 ··· T
ar,m bc,p i=1
called a column-vector.
r Identity matrix – The identity matrix I ∈ Rn×n is a square matrix with ones in its diagonal
and zero everywhere else: where aT
r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
T
1 0 ··· 0 tively.
.. .. .
. . .. r Transpose – The transpose of a matrix A ∈ Rm×n , noted AT , is such that its entries are
I= 0
.. . . ..
flipped:
. . . 0
0 ··· 0 1 ∀i,j, i,j = Aj,i
AT
Remark: for all matrices A ∈ Rn×n , we have A × I = I × A = A.
Remark: for matrices A,B, we have (AB)T = B T AT .
r Diagonal matrix – A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values in
its diagonal and zero everywhere else: r Inverse – The inverse of an invertible square matrix A is noted A−1 and is the only matrix
d1 0 · · · 0 such that:
.. .. ..
. .
D= 0 . AA−1 = A−1 A = I
. .. ..
.. . . 0
0 ··· 0 dn Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1 =
B −1 A−1
Remark: we also note D as diag(d1 ,...,dn ).
r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
n
Matrix operations X
tr(A) = Ai,i
r Vector-vector multiplication – There are two types of vector-vector products: i=1
n
X A = AT and ∀x ∈ Rn , xT Ax > 0
det(A) = |A| = (−1)i+j Ai,j |A\i,\j |
j=1 Remark: similarly, a matrix A is said to be positive definite, and is noted A 0, if it is a PSD
matrix which satisfies for all non-zero vector x, xT Ax > 0.
Remark: A is invertible if and only if |A| 6= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
Matrix properties Az = λz
r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
and antisymmetric parts as follows: r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
A + AT A − AT
A= +
2 2 ∃Λ diagonal, A = U ΛU T
| {z } | {z }
Symmetric Antisymmetric
r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular-
value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m
r Norm – A norm is a function N : V −→ [0, + ∞[ where V is a vector space, and such that unitary, Σ m × n diagonal and V n × n unitary matrices, such that:
for all x,y ∈ V , we have:
A = U ΣV T
• N (x + y) 6 N (x) + N (y)
! p1 Remark: the hessian of f is only defined when f is a function that returns a scalar.
n
r Gradient operations – For matrices A,B,C, the following gradient properties are worth
X
p-norm, Lp ||x||p xpi Hölder inequality
having in mind:
i=1
∇A tr(AB) = B T ∇AT f (A) = (∇A f (A))T
Infinity, L∞ ||x||∞ max |xi | Uniform convergence
i
n
VIP Refresher: Probabilities and Statistics Remark: for any event B in the sample space, we have P (B) =
X
P (B|Ai )P (Ai ).
i=1
Afshine Amidi and Shervine Amidi r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space.
We have:
P (B|Ak )P (Ak )
August 6, 2018 P (Ak |B) = n
X
P (B|Ai )P (Ai )
i=1
Introduction to Probability and Combinatorics
r Sample space – The set of all possible outcomes of an experiment is known as the sample r Independence – Two events A and B are independent if and only if we have:
space of the experiment and is denoted by S. P (A ∩ B) = P (A)P (B)
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred. Random Variables
r Axioms of probability – For each event E, we denote P (E) as the probability of event E r Random variable – A random variable, often noted X, is a function that maps every element
occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms: in a sample space to a real line.
n
! n
[ X r Cumulative distribution function (CDF) – The cumulative distribution function F ,
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei ) which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is
x→−∞ x→+∞
i=1 i=1
defined as:
F (x) = P (X 6 x)
r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
given order. The number of such arrangements is given by P (n, r), defined as: Remark: we have P (a < X 6 B) = F (b) − F (a).
n!
P (n, r) = r Probability density function (PDF) – The probability density function f is the probability
(n − r)! that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know
r Combination – A combination is an arrangement of r objects from a pool of n objects, where in the discrete (D) and the continuous (C) cases.
the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n!
C(n, r) = = Case CDF F PDF f Properties of PDF
r! r!(n − r)! X X
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
xi 6x j
ˆ x ˆ +∞
dF
Conditional Probability (C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1
−∞ dx −∞
r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) = r Variance – The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the
P (B) spread of its distribution function. It is determined as follows:
Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B). Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2
r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
if we have: r Standard deviation – The standard deviation of a random variable, often noted σ, is a
n
measure of the spread of its distribution function which is compatible with the units of the
[ actual random variable. It is determined as follows:
∀i 6= j, Ai ∩ Aj = ∅ and Ai = S p
i=1 σ= Var(X)
r Expectation and Moments of the Distribution – Here are the expressions of the expected r Marginal density and cumulative distribution – From the joint density probability
value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function function fXY , we have:
(ω) for the discrete and continuous cases:
Case Marginal density Cumulative function
E[X k ] (ω)
X XX
Case E[X] E[g(X)] (D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
n
X n
X n
X n
X j xi 6x yj 6y
(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiωxi ˆ ˆ ˆ
+∞ x y
i=1 i=1 i=1 i=1 (C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (x0 ,y 0 )dx0 dy 0
ˆ +∞ ˆ +∞ ˆ +∞ ˆ +∞
−∞ −∞ −∞
Cov(X,Y ) , σXY
2
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
r Correlation – By noting σX , σY the standard deviations of X and Y , we define the correlation
between the random variables X and Y , noted ρXY , as follows:
fY (y) = fX (x)
dx
dy
2
σXY
ρXY =
σX σY
r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
may depend on c. We have: Remarks: For any X, Y , we have ρXY ∈ [−1,1]. If X and Y are independent, then ρXY = 0.
ˆ ˆ b r Main distributions – Here are the main distributions to have in mind:
∂ b ∂b ∂a ∂g
g(x)dx = · g(b) − · g(a) + (x)dx
∂c a ∂c ∂c a ∂c
Type Distribution PDF (ω) E[X] Var(X)
n
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard X ∼ B(n, p) P (X = x) = px q n−x (peiω + q)n np npq
deviation σ. For k, σ > 0, we have the following inequality: x
Binomial x ∈ [[0,n]]
1 (D)
P (|X − µ| > kσ) 6
k2 µx −µ iω
X ∼ Po(µ) P (X = x) = e eµ(e −1) µ µ
x!
Poisson x∈N
Jointly Distributed Random Variables
1 eiωb − eiωa a+b (b − a)2
X ∼ U (a, b) f (x) =
b−a (b − a)iω 2 12
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y , Uniform x ∈ [a,b]
is defined as follows:
2
fXY (x,y) 1 −1
x−µ
1 2
σ2
fX|Y (x) = (C) X ∼ N (µ, σ) f (x) = √ e2 σ
eiωµ− 2 ω µ σ2
fY (y) 2πσ
Gaussian x∈R
1 1 1
r Independence – Two random variables X and Y are said to be independent if we have: X ∼ Exp(λ) f (x) = λe−λx
1− iω
λ
λ λ2
fXY (x,y) = fX (x)fY (y) Exponential x ∈ R+
Parameter estimation
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
are independent and identically distributed with X.
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected
value of the distribution of θ̂ and the true value, i.e.:
Bias(θ̂) = E[θ̂] − θ
r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance σ 2 , then we have:
σ
X ∼ N µ, √
n→+∞ n
September 9, 2018
Regression Classifier
r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes θ ←− θ − α∇J(θ)
r Type of model – The different models are summed up in the table below:
Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
log(L(θ)) which is easier to optimize. We have:
θopt = arg max L(θ)
Examples Regressions, SVMs GDA, Naive Bayes θ
r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
Notations and general concepts that `0 (θ) = 0. Its update rule is as follows:
`0 (θ)
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input θ θ−
`00 (θ)
data x(i) , the model prediction output is hθ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different −1
they are. The common loss functions are summed up in the table below: θ θ − ∇2θ `(θ) ∇θ `(θ)
We assume here that y|x; θ ∼ N (µ,σ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost
η, a sufficient statistic T (y) and a log-partition function a(η) as follows:
function is a closed-form solution such that:
Remark: we will often have T (y) = y. Also, exp(−a(η)) can be seen as a normalization param-
r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
X (i)
∀j, θj ← θj + α y (i) − hθ (x(i) ) xj
i=1
Distribution η T (y) a(η) b(y)
φ
Bernoulli log 1−φ
y log(1 + exp(η)) 1
Remark: the update rule is a particular case of the gradient ascent.
η2 2
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
√1 exp − y2
2π
weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as: 1
Poisson log(λ) y eη
y!
(x(i) − x)2
w(i) (x) = exp − eη
2τ 2 Geometric log(1 − φ) y log 1−eη
1
Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x
1
∀z ∈ R, g(z) = ∈]0,1[
1 + e−z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; θ ∼ Bernoulli(φ). We have the following
form:
exp(θiT x)
φi = where (w, b) ∈ Rn × R is the solution of the following optimization problem:
K
X
exp(θjT x) 1
min ||w||2 such that y (i) (wT x(i) − b) > 1
j=1 2
y ∼ Bernoulli(φ)
r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:
φ
b µbj (j = 0,1) Σ
b
Remark: the line is defined as wT x − b = 0 . m Pm m
1 X 1
i=1 {y (i) =j}
x(i) 1 X
1{y(i) =1} (x(i) − µy(i) )(x(i) − µy(i) )T
r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows: m
1{y(i) =j}
P
m m
i=1 i=1 i=1
L(z,y) = [1 − yz]+ = max(0,1 − yz)
r Kernel – Given a feature mapping φ, we define the kernel K to be defined as: Naive Bayes
K(x,z) = φ(x)T φ(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
independent:
||x−z||2
In practice, the kernel K defined by K(x,z) = exp − 2σ2 is called the Gaussian kernel
n
and is commonly used.
Y
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1
r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
l ∈ [[1,L]]
(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = × #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l
X These methods can be used for both regression and classification problems.
L(w,b) = f (w) + βi hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.
Remark: the coefficients βi are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:
r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:
r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
the sample size m be fixed. Then, with probability of at least 1 − δ, we have:
r
1 2k
(h) 6 min (h) + 2
b log
h∈H 2m δ
r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
b be their sample mean and γ > 0 fixed. We have:
P (|φ − φ
b| > γ) 6 2 exp(−2γ 2 m)
VIP Cheatsheet: Machine Learning Tips r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below:
FP
False Positive Rate 1-specificity
TN + FP
Metrics FPR
Given a set of data points {x(1) , ..., x(m) }, where each x(i) has n features, associated to a set of
outcomes {y (1) , ..., y (m) }, we want to assess a given classifier that learns how to predict y from r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
x. area below the ROC as shown in the following figure:
Classification
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
Predicted class
+ –
TP FN
+ False Negatives Regression
True Positives
Type II error r Basic metrics – Given a regression model f , the following metrics are commonly used to
Actual class assess the performance of the model:
FP TN
– False Positives Total sum of squares Explained sum of squares Residual sum of squares
True Negatives
Type I error m
X m
X m
X
SStot = (yi − y)2 SSreg = (f (xi ) − y)2 SSres = (yi − f (xi ))2
r Main metrics – The following metrics are commonly used to assess the performance of
i=1 i=1 i=1
classification models:
r Model selection – Train model on training set, then evaluate on the development set, then
pick best performance model on the development set, and retrain all of that model on the whole
training set.
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a
model that does not rely too much on the initial training set. The different types are summed
up in the table below:
Diagnostics
k-fold Leave-p-out
r Bias – The bias of a model is the difference between the expected prediction and the correct
- Training on k − 1 folds and - Training on n − p observations and model that we try to predict for given data points.
assessment on the remaining one assessment on the p remaining ones
r Variance – The variance of a model is the variability of the model prediction for given data
- Generally k = 5 or 10 - Case p = 1 is called leave-one-out points.
r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
The most commonly used method is called k-fold cross-validation and splits the training data the model, the higher the variance.
into k folds to validate the model on one fold while training the model on the k − 1 other folds,
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
Underfitting Just right Overfitting
- High training error - Training error - Low training error
Symptoms - Training error close slightly lower than - Training error much
to test error test error lower than test error
- High bias - High variance
Regression
r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:
Classification
Deep learning
r Error analysis – Error analysis is analyzing the root cause of the difference in performance
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor-
mance between the current and the baseline models.
Label 0
x x x x
Summary:
What does it fit? Estimated function Error Function
Linear A line in n dimensions
Polynomial A polynomial of order k
Bayesian Linear Gaussian distribution for each point
Ridge Linear/polynomial
LASSO Linear/polynomial
Logistic Linear/polynomial with sigmoid
• L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
the weights, more complex the model is, more chances of overfitting. L1 regularization
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
average magnitude of all weights
• Entropy: Used for the models that output probability. Forces the probability distribution
towards uniform distribution.
Minimum Error
P(A B)
• How the probability of an event changes when
we have knowledge of another event Posterior
Probability
P(A) P(A B)
Usually, a better
estimate than P(A)
Bayes’ Theorem
Example
• Probability of fire P(F) = 1%
• Probability of smoke P(S) = 10%
Likelihood P(A) Evidence
• Prob of smoke given there is a fire P(S F) = 90%
• What is the probability that there is a fire given P(B A) Prior P(B)
we see a smoke P(F S)? Probability
Naïve Bayes’ theorem assumes the features (x1, x2, … ) are i.i.d. i.e
Source: https://www.cheatsheets.aqeel-anwar.com
Cheat Sheet – Imbalanced Data in Classification
Blue: Label 1
Positive Positive TP + FP TN + FP
(Prec x Rec) TP + TN
F1 score = 2x Accuracy =
(Prec + Rec) TP + FN + FP + TN
False True
0
Negative Negative TN TP
Specificity = Recall, Sensitivity =
TN + FP True +ve rate TP + FN
Possible solutions
1. Data Replication: Replicate the available data until the Blue: Label 1
number of samples are comparable Green: Label 0
2. Synthetic Data: Images: Rotate, dilate, crop, add noise to Blue: Label 1
existing input images and create new data Green: Label 0
3. Modified Loss: Modify the loss to reflect greater error when 𝑙𝑜𝑠𝑠 = 𝑎 ∗ 𝒍𝒐𝒔𝒔𝒈𝒓𝒆𝒆𝒏 + 𝑏 ∗ 𝒍𝒐𝒔𝒔𝒃𝒍𝒖𝒆 𝑎>𝑏
misclassifying smaller sample set
4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly
separable (Con: Overfitting)
Increase model
complexity
No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data.
separate data. Best solution: line y=0, predict all labels blue Green class will no longer be predicted as blue
2.Boosting: Trains N different weak models (usually of same types – homogenous) with the complete dataset in a
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.
3.Stacking: Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
labels are fed to the meta learner which generates the final prediction.
The block diagrams, and comparison table for each of these three methods can be seen below.
Ensemble Method – Boosting Ensemble Method – Bagging
Input Dataset Step #1 Input Dataset
Step #1 Create N subsets
Assign equal weights Complete dataset from original Subset #1 Subset #2 Subset #3 Subset #4
to all the datapoints dataset, one for each
in the dataset weak model
Uniform weights
Step #2
Train each weak
Weak Model Weak Model Weak Model Weak Model
Step #2a Step #2b model with an
Train a weak model Train Weak • Based on the final error on the independent #1 #2 #3 #4
with equal weights to trained weak model, calculate a subset, in
Model #1 parallel
all the datapoints scalar alpha.
• Use alpha to increase the weights of
wrongly classified points, and
decrease the weights of correctly
alpha1 Adjusted weights classified points
Step #3
In the test phase, predict from
each weak model and vote their Voting
Step #3b predictions to get final prediction
Step #3a Train Weak • Based on the final error on the
Train a weak model Model #2 trained weak model, calculate a
with adjusted weights scalar alpha.
on all the datapoints • Use alpha to increase the weights of
in the dataset wrongly classified points, and Final Prediction
decrease the weights of correctly
alpha2 Adjusted weights classified points
Train Weak
Step #(n+1)a Model #4 Step #2
Train a weak model Train each weak
with adjusted weights model with the
Train Weak Train Weak Train Weak Train Weak
on all the datapoints weak learner Model #1 Model #2 Model #3 Model #4
in the dataset dataset
alpha3
x x x x Input Dataset
Subset #1 – Weak Learners Subset #2 – Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction Step #3
Voting Train a meta-
learner for which Trained Weak Trained Weak Trained Weak Trained Weak
the input is the
outputs of the Model Model Model Model
weak models for #1 #2 #3 #4
the Meta Learner
dataset
Final Prediction
September 9, 2018
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering
Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj − µj 1 X (i) 1 X (i)
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m
i=1 i=1
Clustering assessment metrics
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since 1 X T
we don’t have the ground truth labels as was the case in the supervised learning setting. • Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
r Silhouette coefficient – By noting a and b the mean distance between a sample and all i=1
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b−a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.
k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:
Tr(Bk ) N −k
s(k) = × Independent component analysis
Tr(Wk ) k−1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = λz
• Write the probability of x = As = W −1 s as:
Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:
1 − 2g(w1T x(i) )
1 − 2g(w2 x ) x(i) T + (W T )−1
T (i)
W ←− W + α .
..
1 − 2g(wn T x(i) )
Figure 1 Figure 2
Feature # 1 (F1)
FeFeature # 1
Variance
Variance
1
e#
2
ur
#
re
at
atu
Fe
ew
w
Ne
N
at
Fe F2 F2 (new feature # 1) and project the data on
w
w
Ne
Ne
Source: https://www.cheatsheets.aqeel-anwar.com
4. Deep learning
CS 229 – Machine Learning https://stanford.edu/~shervine
VIP Cheatsheet: Deep Learning r Learning rate – The learning rate, often noted η, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called
Adam, which is a method that adapts the learning rate.
Neural networks are a class of models that are built with layers. Commonly used types of neural ∂L(z,y)
networks include convolutional and recurrent neural networks. w ←− w − η
∂w
r Architecture – The vocabulary around neural networks architectures is described in the
figure below:
r Updating weights – In a neural network, weights are updated as follows:
where we note w, b, z the weight, bias and output respectively. r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
r Activation function – Activation functions are used at the end of a hidden unit to introduce p or kept with probability 1 − p.
non-linear complexities to the model. Here are the most common ones:
W − F + 2P
N = +1
S
follows:
xi − µ B
xi ←− γ p +β
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is 2 +
σB
commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z) It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Types of gates – Here are the different types of gates that we encounter in a typical recurrent V0 (s) = 0
neural network:
• We iterate the value based on the values before:
Input gate Forget gate Output gate Gate
" #
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing? X
0 0
Vi+1 (s) = R(s) + max γPsa (s )Vi (s )
a∈A
s0 ∈S
r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids
the vanishing gradient problem by adding ’forget’ gates.
r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
Reinforcement Learning and Control
#times took action a in state s and got to s0
Psa (s0 ) =
The goal of reinforcement learning is for an agent to learn how to evolve in an environment. #times took action a in state s
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },γ,R)
where: r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:
r Value function – For a given policy π and a given state s, we define the value function V π
as follows:
h i
V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...|s0 = s,π
∗
r Bellman equation – The optimal Bellman equations characterizes the value function V π
of the optimal policy π ∗ :
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S
Remark: we note that the optimal policy π ∗ for a given state s is such that:
X
π ∗ (s) = argmax Psa (s0 )V ∗ (s0 )
a∈A
s0 ∈S
Illustration
r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
- Maximum padding
- No padding - Padding such that feature
l m such that end
map size has size I
convolutions are
- Drops last S
Purpose applied on the limits
convolution if - Output size is
of the input
dimensions do not mathematically convenient
match - Filter ’sees’ the input
- Also called ’half’ padding
end-to-end
Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.
r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in
by which the window moves after each operation. the formula above.
r Understanding the complexity of the model – In order to assess the complexity of a ReLU Leaky ReLU ELU
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows: g(z) = max(z,z) g(z) = max(α(ez − 1),z)
g(z) = max(0,z)
with 1 with α 1
CONV POOL FC
Illustration
r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and 1.6 Object detection
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula: r Types of models – There are 3 main types of object recognition algorithms, for which the
k j−1 nature of what is predicted is different. They are described in the table below:
X Y
Rk = 1 + (Fj − 1) Si
Classification
j=1 i=0 Image classification Detection
w. localization
In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =
5.
r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:
• Step 2: For each grid cell, run a CNN that predicts y of the following form:
Box of center (bx ,by ), height bh
Reference points (l1x ,l1y ), ...,(lnx ,lny ) T
and width bw y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,... ∈ RG×G×k×(5+p)
| {z }
repeated k times
r Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
box Ba . It is defined as: detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were
detected, and k is the number of anchor boxes.
Bp ∩ Ba
IoU(Bp ,Ba ) = • Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
Bp ∪ Ba
lapping bounding boxes.
Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx , ..., cp have to be ignored.
Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5. r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes. detection algorithm to find most probable objects in those bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
r Types of models – Two main types of model are summed up in table below:
r Content cost function – The content cost function Jcontent (C,G) is used to determine how
the generated image G differs from the original content image C. It is defined as follows:
1 [l](C)
Jcontent (C,G) = ||a − a[l](G) ||2
2
r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
[l]
elements Gkk0 quantifies how correlated the channels k and k0 are. It is defined with respect to
r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited activations a[l] as follows:
training set to learn a similarity function that quantifies how different two given images are. The [l] [l]
similarity function applied to two images is often noted d(image 1, image 2). n nw
H X
X
[l] [l] [l]
Gkk0 = aijk aijk0
r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
i=1 j=1
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).
Remark: the style matrix for the style image and the generated image are noted G[l](S) and
r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation G[l](G) respectively.
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+ r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
the margin parameter, this loss is defined as follows: generated image G differs from the style S. It is defined as follows:
r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters α,β, as follows:
Remark: a higher value of α will make the model care more about the content while a higher
value of β will make it care more about the style.
2.1 Overview
r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while having
hidden states. They are typically as follows:
Remark: use cases using variants of GANs include text to image, music generation and syn-
thesis.
r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
a[l+2] = g(a[l] + z [l+2] ) For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
r Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1 a<t> = g1 (Waa a<t−1> + Wax x<t> + ba ) and y <t> = g2 (Wya a<t> + by )
convolution trick to lower the burden of computation.
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
functions
? ? ?
The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
- Possibility of processing input of any length - Computation being slow
- Model size not increasing with size of input - Difficulty of accessing information
- Computation takes into account from a long time ago
historical information - Cannot consider any future input
- Weights are shared across time for the current state
r Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:
1 ez − e−z
g(z) = g(z) = g(z) = max(0,z)
1 + e−z ez + e−z
Many-to-one
Sentiment classification
Tx > 1, Ty = 1
Many-to-many
Name entity recognition
Tx = Ty r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.
Many-to-many r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for
Machine translation the gradient, this phenomenon is controlled in practice.
Tx 6= Ty
r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:
Ty
X r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
y ,y) =
L(b y <t> ,y <t> )
L(b in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
t=1 are equal to:
r Backpropagation through time – Backpropagation is done at each point in time. At where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows: are summed up in the table below:
- Noted ow - Noted ew
- Naive approach, no similarity information - Takes into account words similarity
Dependencies r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps
its 1-hot representation ow to its embedding ew as follows:
ew = Eow
Remark: learning the embedding matrix can be done using target/context likelihood models.
Remark: the sign ? denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures: 2.3.2 Word embeddings
Bidirectional Deep r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the
(BRNN) likelihood that a given word is surrounded by other words. Popular models include skip-gram,
(DRNN) negative sampling and CBOW.
r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting θt a parameter associated with t, the probability P (t|c) is given by:
exp(θtT ec )
P (t|c) =
|V |
X
exp(θjT ec )
j=1
Remark: summing over the whole vocabulary in the denominator of the softmax part makes
this model computationally expensive. CBOW is another word2vec model using the surrounding
words to predict a given word.
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at 2.5 Language model
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context r Overview – A language model aims at estimating the probability of a sentence P (y).
word c and a target word t, the prediction is expressed by:
r n-gram model – This model is a naive approach aiming at quantifying the probability that
P (y = 1|c,t) = σ(θtT ec ) an expression appears in a corpus by counting its number of appearance in the training data.
Remark: this method is less computationally expensive than the skip-gram model. r Perplexity – Language models are commonly assessed using the perplexity metric, also
known as PP, which can be interpreted as the inverse probability of the dataset normalized by
r GloVe – The GloVe model, short for global vectors for word representation, is a word em- the number of words T . The perplexity is such that the lower, the better and is defined as
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of follows:
times that a target i occurred with a context j. Its cost function J is as follows: ! T1
T
|V |
Y 1
1 X PP =
J(θ) = f (Xij )(θiT ej + bi + b0j − log(Xij ))2
P|V | (t) (t)
yj · b
yj
2 t=1 j=1
i,j=1
Remark: PP is commonly used in t-SNE.
here f is a weighting function such that Xi,j = 0 =⇒ f (Xi,j ) = 0.
(final)
Given the symmetry that e and θ play in this model, the final word embedding ew is given
by: 2.6 Machine translation
(final) e w + θw r Overview – A machine translation model is similar to a language model except it has an
ew =
2 encoder network placed before. For this reason, it is sometimes referred as a conditional language
model. The goal is to find a sentence y such that:
Remark: the individual components of the learned word embeddings are not necessarily inter-
pretable. y= arg max P (y <1> ,...,y <Ty > |x)
y <1> ,...,y <Ty >
2.4 Comparing words r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
• Step 1: Find top B likely words y <1>
w1 · w2
similarity = = cos(θ)
||w1 || ||w2 || • Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k−1>
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search. Remark: the attention scores are commonly used in image captioning and machine translation.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Ty
1 X h i
Objective = log p(y <t> |x,y <1> , ..., y <t−1> )
Tyα
t=1
Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.
r Error analysis – When obtaining a predicted translation b y that is bad, one can wonder why r Attention weight – The amount of attention that the output y <t> should pay to the
0 0
we did not get a good translation y ∗ by performing the following error analysis: activation a<t > is given by α<t,t > computed as follows:
0
0 exp(e<t,t >)
Case P (y ∗ |x) > P (b
y |x) P (y ∗ |x) 6 P (b
y |x) α<t,t >
=
Tx
00
X
Root cause Beam search faulty RNN faulty exp(e<t,t >
)
- Try different architecture t00 =1
Remedies Increase beam width - Regularize Remark: computation complexity is quadratic with respect to Tx .
- Get more data
? ? ?
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is defined as follows:
n
!
1 X
bleu score = exp pk
n
k=1
2.7 Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
0
in practice. By noting α<t,t > the amount of attention that the output y <t> should pay to the
0
activation a <t > and c <t> the context at time t, we have:
0
> <t0 > 0
X X
c<t> = α<t,t a with α<t,t >
=1
t0 t0
r Data augmentation – Deep learning models usually need a lot of data to be properly trained. r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
It is often useful to get more data from the existing ones using data augmentation techniques. where the model sees the whole training set to update its weights.
The main ones are summed up in the table below. More precisely, given the following input
r Mini-batch gradient descent – During the training phase, updating weights is usually not
image, here are the techniques that we can apply:
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
Original Flip Rotation Random crop points in a batch is a hyperparameter that we can tune.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z)
- Random focus
- Flipped with respect - Rotation with on one part of
- Image without to an axis for which a slight angle the image
3.2.2 Finding optimal weights
any modification the meaning of the - Simulates incorrect - Several random
image is preserved horizon calibration crops can be r Backpropagation – Backpropagation is a method to update the weights in the neural network
done in a row by taking into account the actual output and the desired output. The derivative with respect
to each weight w is computed using the chain rule.
follows:
xi − µB
xi ←− γ p +β
2 +
σB
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier - Root Mean Square propagation
dw db
initialization enables to have initial weights that take into account characteristics that are unique RMSprop - Speeds up learning algorithm w − α√ b ←− b − α √
to the architecture. by controlling oscillations
sdw sdb
r Transfer learning – Training a deep learning model requires a lot of data and more impor- - Adaptive Moment estimation
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets vdw vdb
Adam - Most popular method w − α√ b ←− b − α √
that took days/weeks to train, and leverage it towards our use case. Depending on how much sdw + sdb +
data we have at hand, here are the different ways to leverage this: - 4 parameters to tune
r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
Freezes all layers, data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
Small
trains weights on softmax much on particular sets of features.
Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
Trains weights on layers r Weight regularization – In order to make sure that the weights are not too large and that
Large and softmax by initializing the model is not overfitting the training set, regularization techniques are usually performed on
weights on pre-trained ones the model weights. The main ones are summed up in the table below:
r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
h i
the training time and improve the numerical optimal solution. While Adam optimizer is the ... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22
most commonly used technique, others can also be useful. They are summed up in the table λ∈R λ∈R
below: λ ∈ R,α ∈ [0,1]
r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
? ? ?
CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
output.
a. Convolutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
Fully Connected Layer Convolutional Layer
w11*x
x1 1+ b1
+ b1 y1
w21*x2
x2
b1
3+
1*x
x3 w3
1.5
4.0 0.4
1.0
2.0
0.5 0.2
VGGNet – 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
ResNet – 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
‘XX’ denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
Filter
Concatenation
Weight layer
f(x) x 1x1
3x3
Conv
5x5
Conv
1x1 Conv
+ Previous
f(x)+x Layer
Legend: x,y stand for any kind of data values, s for a string, n for a number, L for a list where i,j are list indexes, D stands for a dictionary and k is a dictionary key.
Alvaro Sebastian
Python 3 Beginner's Reference Cheat Sheet http://www.sixthresearcher.com
Legend: x,y stand for any kind of data values, s for a string, n for a number, L for a list where i,j are list indexes, D stands for a dictionary and k is a dictionary key.
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file:
pickled_data = pickle.load(file)
>>> df_sheet2 = data.parse('1960-1966',
Learn Python for data science Interactively at www.DataCamp.com skiprows=[0],
names=['Country',
'AAM: War(2002)'])
>>> df_sheet1 = data.parse(0, HDF5 Files
parse_cols=[0],
Importing Data in Python skiprows=[0], >>> import h5py
>>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5'
names=['Country'])
Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r')
your data: To access the sheet names, use the sheet_names attribute:
>>> import numpy as np >>> data.sheet_names
>>> import pandas as pd Matlab Files
Help SAS Files >>> import scipy.io
>>> filename = 'workspace.mat'
>>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename)
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file:
df_sas = file.to_data_frame()
1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')
Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row
>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF
##⌴Header 2 https://sqlbak.com
Header 2
Header 2 [Link](https://sqlbak.com
-------- "optional title") Link
Click [here][id]
###⌴Header 3 Header 3 [id]:https://sqlbak.com
list(text) Split text into character tokens g=nltk.CFG.fromstring("""..."" Manually define grammar
trees=parser.parse_all(text)
Accessing corpora and lexical resources
for tree in trees: ... print tree
from nltk.corpus import import CorpusReader object
brown from nltk.corpus import treebank
df=pd.DataFrame(time_sents, columns=['text'])
df['text'].str.split().str.len()
df['text'].str.contains('word')
df['text'].str.count(r'\d')
df['text'].str.findall(r'\d')
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.extract(r'(\d?\d):(\d\d)')
df['text'].str.extractall(r'((\d?\d):(\d\d) ?
([ap]m))')
df['text'].str.extractall(r'(?P<digits>\d)')
756 plot([X],Y,[fmt],…) API ax.set_[xy]scale(scale,…) from matplotlib import ticker import matplotlib.animation as mpla
Cheat sheet 432 Version 3.5.0 X, Y, fmt, color, marker, linestyle linear
0.0 log
0.0 ax.[xy]axis.set_[minor|major]_locator(locator)
1 - + any values
2.5 0 + values > 0
2.510102101 0logit ticker.NullLocator() T = np.linspace(0, 2*np.pi, 100)
432
plt.gcf(), animate, interval=5)
height, width, bottom, align, color ticker.IndexLocator(base=0.5, offset=0.25)
plt.show()
1 subplot(…,projection=p) 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
ticker.AutoLocator()
X = np.linspace(0, 2*np.pi, 100)
765 1234567 imshow(Z,…) API
p=’polar’ p=’3d’ 0 1 2 3 4 5
Styles API
3421
Y = np.cos(X) ticker.MaxNLocator(n=4)
Z, cmap, interpolation, extent, origin 0.0 1.5 3.0 4.5
fig, ax = plt.subplots() ticker.LogLocator(base=10, numticks=15) plt.style.use(style)
21
0.5 0.5
fig.savefig(“figure.pdf”)
Tick formatters
0.0 0.0 0.0
fig.show()
32 1234567 pcolormesh([X],[Y],Z,…)
1.0 1.0 1.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
Anatomy of a figure
4
linestyle or ls
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6
1 capstyle or dash_capstyle
Legend ticker.FuncFormatter(lambda x, pos: "[%.2f]" % x) 0.0 0.0 0.0
432
3
>0< >1< >2< >3< >4< >5<
Major tick label Grid
Quick reminder
Line
(line plot) 1 Markers API 0
ticker.ScalarFormatter()
1 2 3 4 5
2
432 TEX y, text, va, ha, size, weight, transform 0.0 1.0 2.0 3.0 4.0 5.0 ax.patch.set_alpha(0)
1 ax.set_[xy]lim(vmin, vmax)
'.' 'o' 's' 'P' 'X' '*' 'p' 'D' '<' '>' '^' 'v' ticker.PercentFormatter(xmax=5)
Y axis label 0% 20% 40% 60% 80% 100% ax.set_[xy]label(label)
756 1234567 X,fill[_between][x](…)
Markers
(scatter plot)
API '1' '2' '3' '4' '+' 'x' '|' '_' 4 5 6 7 ax.set_[xy]ticks(list)
1 432 Y1, Y2, color, where
Ornaments
ax.set_[xy]ticklabels(list)
1 '$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $'
markevery
ax.set_[sup]title(title)
1234567
Spines ax.legend(…) API ax.tick_params(width=10, …)
Figure Line 10 [0, -1] (25, 5) [0, 25, -1] handles, labels, loc, title, frameon ax.set_axis_[on|off]()
Axes (line plot)
Advanced plots
0
765
title
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4 fig.tight_layout()
X axis label step(X,Y,[fmt],…) API
Colors Legend
432
Minor tick label API plt.gcf(), plt.gca()
X axis label X, Y, fmt, color, marker, where handletextpad
label
handle
1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 markerfacecolor (mfc) mpl.rc(’axes’, linewidth=1, …)
1
handlelength
’Cn’
Subplots layout
0 Label 1 Label 3 fig.patch.set_alpha(0)
765 1234567 X,boxplot(X,…)
API 0 b 2 g 4 r 6 c 8 m 10 y 12 k 14 w 16 ’x’
API 1 DarkRed Firebrick Crimson IndianRed Salmon labelspacing markeredgecolor (mec) text=r’$\frac{-e^{i\pi}}{2^n}$’
’name’
ax = d.new_horizontal(’10%’) 3
d=make_axes_locatable(ax) API
42 D, positions, widths, vert plasma
text, xy, xytext, xycoords, textcoords, arrowprops
1 Sequential
text Ten simple rules
765 1234567 barbs([X],[Y],
Greys READ
API U, V, …) YlOrBr xytext xy
4
textcoords xycoords
X, Y, U, V, C, length, pivot, sizes
231
Wistia 1. Know Your Audience
Getting help Diverging 2. Identify Your Message
fig, ax = plt.subplots()
5. Do Not Trust the Defaults
6. Use Color Effectively
W stackoverflow.com/questions/tagged/matplotlib
Ż gitter.im/matplotlib
7
654 1234567 hexbin(X,Y,C,…) API
X, Y, C, gridsize, bins
tab10
tab20
def on_click(event):
print(event)
7. Do Not Mislead the Reader
8. Avoid “Chartjunk”
F twitter.com/matplotlib
a Matplotlib users mailing list
321 Cyclic
twilight
fig.canvas.mpl_connect(
’button_press_event’, on_click)
9. Message Trumps Beauty
10. Get the Right Tool
1234567
Axes adjustments API Uniform colormaps Color names API Legend placement How do I …
plt.subplots_adjust( … ) viridis
plasma
black
k
dimgray
dimgrey
gray
floralwhite
darkgoldenrod
goldenrod
cornsilk
gold
darkturquoise
cadetblue
powderblue
lightblue
deepskyblue
L K J … resize a figure?
→ fig.set_size_inches(w, h)
… save a figure?
A 2 9 1 I
grey lemonchiffon skyblue
inferno
darkgray khaki lightskyblue → fig.savefig(”figure.pdf”)
darkgrey palegoldenrod steelblue
magma silver darkkhaki aliceblue … save a transparent figure?
top lightgray ivory dodgerblue
axes width cividis lightgrey beige lightslategray → fig.savefig(”figure.pdf”, transparent=True)
gainsboro lightyellow lightslategrey
whitesmoke lightgoldenrodyellow slategray … clear a figure/an axes?
w olive slategrey
→ fig.clear() → ax.clear()
Sequential colormaps white
snow
y
yellow
lightsteelblue
cornflowerblue
rosybrown olivedrab royalblue … close all figures?
B 6 10 7 H
figure height
lightcoral yellowgreen ghostwhite
axes height indianred darkolivegreen lavender → plt.close(”all”)
brown greenyellow midnightblue
Greys
firebrick chartreuse navy … remove ticks?
hspace Purples maroon lawngreen darkblue → ax.set_[xy]ticks([])
darkred honeydew mediumblue
Blues r darkseagreen b … remove tick labels ?
red palegreen blue
Greens mistyrose lightgreen slateblue → ax.set_[xy]ticklabels([])
salmon forestgreen darkslateblue
tomato limegreen mediumslateblue … rotate tick labels ?
C 3 8 4 G
left bottom wspace right Oranges
darksalmon darkgreen mediumpurple
Reds coral g rebeccapurple → ax.set_[xy]ticks(rotation=90)
orangered green blueviolet
YlOrBr lightsalmon lime indigo … hide top spine?
sienna seagreen darkorchid
figure width
D E F
YlOrRd seashell mediumseagreen darkviolet → ax.spines[’top’].set_visible(False)
chocolate springgreen mediumorchid
OrRd saddlebrown mintcream thistle … hide legend border?
sandybrown mediumspringgreen plum
peachpuff mediumaquamarine violet → ax.legend(frameon=False)
PuRd peru aquamarine purple
Extent & origin API linen turquoise darkmagenta ax.legend(loc=”string”, bbox_to_anchor=(x,y)) … show error as shaded region?
RdPu bisque lightseagreen m
darkorange mediumturquoise fuchsia
2: upper left 9: upper center 1: upper right → ax.fill_between(X, Y+error, Y‐error)
ax.imshow( extent=…, origin=… ) BuPu burlywood azure magenta
GnBu
antiquewhite
tan
lightcyan
paleturquoise
orchid
mediumvioletred 6: center left 10: center 7: center right … draw a rectangle?
origin="upper" origin="upper" PuBu
navajowhite
blanchedalmond
darkslategray
darkslategrey
deeppink
hotpink 3: lower left 8: lower center 4: lower right → ax.add_patch(plt.Rectangle((0, 0), 1, 1)
5
(0,0) (0,0)
YlGnBu
papayawhip
moccasin
teal
darkcyan
lavenderblush
palevioletred … draw a vertical line?
orange c crimson A: upper right / (-0.1,0.9) B: center right / (-0.1,0.5) → ax.axvline(x=0.5)
PuBuGn wheat aqua pink
oldlace cyan lightpink C: lower right / (-0.1,0.1) D: upper left / (0.1,-0.1) … draw outside frame?
BuGn E: upper center / (0.5,-0.1) F: upper right / (0.9,-0.1)
0
(4,4) (4,4) → ax.plot(…, clip_on=False)
extent=[0,10,0,5] extent=[10,0,0,5] YlGn G: lower left / (1.1,0.1) H: center left / (1.1,0.5)
Image interpolation API … use transparency?
origin="lower" origin="lower" I: upper left / (1.1,0.9) J: lower right / (0.9,1.1) → ax.plot(…, alpha=0.25)
5
(4,4) (4,4)
Diverging colormaps K: lower center / (0.5,1.1) L: lower left / (0.1,1.1) … convert an RGB image into a gray image?
→ gray = 0.2989*R + 0.5870*G + 0.1140*B
PiYG … set figure background color?
0
(0,0) (0,0) Annotation connection styles API → fig.patch.set_facecolor(“grey”)
extent=[0,10,0,5] extent=[10,0,0,5] PRGn
0 10 0 10 BrBG
… get a reversed colormap?
PuOr None none nearest arc3,
rad=0
arc3,
rad=0.3
angle3,
angleA=0, → plt.get_cmap(“viridis_r”)
… get a discrete colormap?
angleB=90
Text alignments
RdGy
API → plt.get_cmap(“viridis”, 10)
RdBu
… show a figure for one second?
ax.text( …, ha=… , va=…, …) RdYlBu
→ fig.show(block=False), time.sleep(1)
Matplotlib
RdYlGn
(1,1)
top
Performance tips
Spectral
Text parameters API bar, bar, bar, cla(), imshow(…), canvas.draw() slow
spline36 hanning hamming
fraction=0.3 fraction=-0.3 angle=180,
Pastel1 fraction=-0.2 im.set_data(…), canvas.draw() fast
ax.text(…, family=…, size=…, weight=…) Pastel2
ax.text(…, fontproperties=…) Paired Beyond Matplotlib
The quick brown fox
Accent
xx-large (1.73)
Dark2 Seaborn: Statistical Data Visualization
The quick brown fox x-large (1.44)
Set1 Cartopy: Geospatial Data Processing
The quick brown fox large (1.20) yt: Volumetric data Visualization
The quick brown fox
The quick brown fox
medium
small
(1.00)
(0.83)
Set2
Set3
hermite kaiser quadric Annotation arrow styles API mpld3: Bringing Matplotlib to the browser
The quick brown fox
The quick brown fox
x-small (0.69)
tab10 Datashader: Large data processing pipeline
xx-small (0.58)
plotnine: A Grammar of Graphics for Python
The quick brown fox jumps over the lazy dog black (900)
tab20
- <- ->
The quick brown fox jumps over the lazy dog bold (700) tab20b
The quick brown fox jumps over the lazy dog semibold (600) tab20c
The quick brown fox jumps over the lazy dog normal (400) <-> <|- -|> Matplotlib Cheatsheets
The quick brown fox jumps over the lazy dog Copyright (c) 2021 Matplotlib Development Team
ultralight (100)
Miscellaneous colormaps
catrom gaussian bessel Released under a CC‐BY 4.0 International License
The quick brown fox jumps over the lazy dog monospace
The quick brown fox jumps over the lazy dog serif
<|-|> ]- -[
The quick brown fox jumps over the lazy dog sans
The quick brown fox jumps over the lazy dog cursive
terrain
ocean ]-[ |-| ]->
The quick brown fox jumps over the lazy dog italic
The quick brown fox jumps over the lazy dog normal cubehelix
rainbow <-[ simple fancy
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
small-caps
normal twilight mitchell sinc lanczos
Matplotlib for beginners
Matplotlib is a library for making 2D plots in Python. It is
Z = np.random.uniform(0, 1, (8,8)) 765 Organize
designed with the philosophy that you should be able to
create simple plots with just a few commands: 432 You can plot several data on the the same figure, but you
ax.contourf(Z)
1 can also split a figure in several subplots (named Axes):
1 Initialize
Z = np.random.uniform(0, 1, 4) 765 1234567 765
432
X = np.linspace(0, 10, 100)
import numpy as np Y1, Y2 = np.sin(X), np.cos(X) 432
import matplotlib.pyplot as plt ax.pie(Z)
1 ax.plot(X, Y1, X, Y2)
1
Z = np.random.normal(0, 1, 100) 71
61
51 1234567 1234567
2 Prepare
41
31
21
fig, (ax1, ax2) = plt.subplots((2,1))
ax1.plot(X, Y1, color=”C1”)
X = np.linspace(0, 4*np.pi, 1000) ax.hist(Z)
111 ax2.plot(X, Y2, color=”C0”)
Y = np.sin(X)
X = np.arange(5) 765 1234567
432
fig, (ax1, ax2) = plt.subplots((1,2))
3 Render Y = np.random.uniform(0, 1, 5) ax1.plot(Y1, X, color=”C1”)
fig, ax = plt.subplots()
ax.errorbar(X, Y, Y∕4)
1 ax2.plot(Y2, X, color=”C0”)
21
1.0 fig.suptitle(None)
0.5 Tweak ax.set_title(”A Sine wave”)
1234567
0.0
0.5 You can modify pretty much anything in a plot, including lim-
ax.plot(X, Y)
1.0 its, colors, markers, line width and styles, ticks and ticks la-
0 5 10 15 20 25 30 ax.set_ylabel(None)
bels, titles, etc.
ax.set_xlabel(”Time”)
765
Time
Choose X = np.linspace(0, 10, 100)
Y = np.sin(X) 432 Explore
Matplotlib offers several kind of plots (see Gallery): ax.plot(X, Y, color=”black”)
1 Figures are shown with a graphical user interface that al-
X = np.random.uniform(0, 1, 100) 765 X = np.linspace(0, 10, 100) 765 1234567 lows to zoom and pan the figure, to navigate between the
Y = np.random.uniform(0, 1, 100) 432 Y = np.sin(X) 432 different views and to show the value under the mouse.
ax.scatter(X, Y)
1 ax.plot(X, Y, linestyle=”--”)
1
X = np.arange(10) 765 1234567 X = np.linspace(0, 10, 100) 765 1234567 Save (bitmap or vector format)
Y = np.random.uniform(1, 10, 10) 432 Y = np.sin(X) 432
ax.bar(X, Y)
1 ax.plot(X, Y, linewidth=5)
1 fig.savefig(”my-first-figure.png”, dpi=300)
fig.savefig(”my-first-figure.pdf”)
Z = np.random.uniform(0, 1, (8,8)) 765 1234567 X = np.linspace(0, 10, 100) 765 1234567
432 Y = np.sin(X) 432
1 1
Matplotlib 3.5.0 handout for beginners. Copyright (c) 2021 Matplotlib Development
ax.imshow(Z) ax.plot(X, Y, marker=”o”) Team. Released under a CC-BY 4.0 International License. Supported by NumFOCUS.
1234567 1234567
Matplotlib for intermediate users
A matplotlib figure is composed of a hierarchy of elements Ticks & labels Legend
that forms the actual figure. Each element can be modified.
from mpl.ticker import MultipleLocator as ML ax.plot(X, np.sin(X), ”C0”, label=”Sine”)
from mpl.ticker import ScalarFormatter as SF ax.plot(X, np.cos(X), ”C1”, label=”Cosine”)
4
Anatomy of a figure ax.xaxis.set_minor_locator(ML(0.2)) ax.legend(bbox_to_anchor=(0,1,1,.1),ncol=2,
Title Blue signal ax.xaxis.set_minor_formatter(SF()) mode=”expand”, loc=”lower left”)
Major tick Red signal ax.tick_params(axis=’x’,which=’minor’,rotation=90)
Legend Sine Sine and Cosine Cosine
Minor tick 0 1 2 3 4 5
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2.2
2.4
2.6
2.8
3.2
3.4
3.6
3.8
4.2
4.4
4.6
4.8
3
Major tick label Grid Lines & markers
Line
(line plot)
X = np.linspace(0.1, 10*np.pi, 1000) Annotation
Y axis label
2 Y = np.sin(X)
ax.plot(X, Y, ”C1o:”, markevery=25, mec=”1.0”) ax.annotate(”A”, (X[250],Y[250]),(X[250],-1),
Y axis label Markers ha=”center”, va=”center”,arrowprops =
(scatter plot) 1 {”arrowstyle” : ”->”, ”color”: ”C1”})
0
1 1 1
0 5 10 15 20 25 30 0
Spines 1 A
Figure Line 0 5 10 15 20 25 30
Axes (line plot) Scales & projections
0
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4 Colors
X axis label fig, ax = plt.subplots()
Minor tick label
X axis label ax.set_xscale(”log”)
ax.plot(X, Y, ”C1o-”, markevery=25, mec=”1.0”) 1 AnyC0
color can be used, but Matplotlib offers sets of colors:
C1 C2 C3 C4 C5 C6 C7 C8 C9
10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure, axes & spines
1 0 2 4 6 8 10 12 14 16
0 0
1 0 2 4 6 8 10 12 14 16
fig, axs = plt.subplots((3,3)) 10 1 100 101
Size & DPI
axs[0,0].set_facecolor(”#ddddff”)
axs[2,2].set_facecolor(”#ffffdd”) Text & ornaments Consider a square figure to be included in a two-columns A4
paper with 2cm margins on each side and a column separa-
gs = fig.add_gridspec(3, 3) tion of 1cm. The width of a figure is (21 - 2*2 - 1)/2 = 8cm.
ax.fill_betweenx([-1,1],[0],[2*np.pi])
ax = fig.add_subplot(gs[0, :]) One inch being 2.54cm, figure size should be 3.15×3.15 in.
ax.text(0, -1, r” Period $\Phi$”)
ax.set_facecolor(”#ddddff”)
fig = plt.figure(figsize=(3.15,3.15), dpi=50)
1 plt.savefig(”figure.pdf”, dpi=600)
fig, ax = plt.subplots() 0
ax.spines[”top”].set_color(”None”) 1 Period Matplotlib 3.5.0 handout for intermediate users. Copyright (c) 2021 Matplotlib De-
velopment Team. Released under a CC-BY 4.0 International License. Supported by
ax.spines[”right”].set_color(”None”) 0 5 10 15 20 25 30 NumFOCUS.
Matplotlib tips & tricks
Transparency Text outline Colorbar adjustment
Scatter plots can be enhanced by using transparency (al- Use text outline to make text more visible. You can adjust a colorbar’s size when adding it.
pha) in order to show area with higher density. Multiple scat-
import matplotlib.patheffects as fx im = ax.imshow(Z)
ter plots can be used to delineate a frontier. text = ax.text(0.5, 0.1, ”Label”)
text.set_path_effects([ cb = plt.colorbar(im,
X = np.random.normal(-1, 1, 500) fx.Stroke(linewidth=3, foreground=’1.0’), fraction=0.046, pad=0.04)
Y = np.random.normal(-1, 1, 500) fx.Normal()]) cb.set_ticks([])
ax.scatter(X, Y, 50, ”0.0”, lw=2) # optional
ax.scatter(X, Y, 50, ”1.0”, lw=0) # optional
ax.scatter(X, Y, 40, ”C1”, lw=0, alpha=0.1)
Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com Learn more at web page or vignette • package version • Updated: 3/15
Types Matrices Strings Also see the stringr package.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
w
ww Transpose
grep(pattern, x) Find regular expression matches in x.
ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.
ww
as.numeric 1, 0, 1
numbers.
w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.
as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is a collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor by ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(y ~ x, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr package. Data Frames Linear model. Perform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(y ~ x, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Perform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poisson rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate package.
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Read functions Parsing data types
Data
Tidy Import
Data
with readr, tibble, and tidyr
Read tabular data to tibbles readr functions guess the types of each column
and convert types when appropriate (but will
with tidyr Cheat Sheet These functions share the common arguments: NOT convert strings to factors automatically).
Cheat Sheet read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"),
A message shows the type of each column in
quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
the result.
min(1000, n_max), progress = interactive())
## Parsed with column specification:
A B C read_csv() ## cols(
a,b,c age is an
R’s tidyverse is built around tidy data stored in 1 2 3 Reads comma delimited files. ## age = col_integer(),
tibbles, an enhanced version of a data frame. 1,2,3 4 5 NA read_csv("file.csv") ## sex = col_character(), integer
4,5,NA ## earn = col_double()
The front side of this sheet shows how ## ) sex is a
earn is a double (numeric) character
to read text files into R with readr. A B C read_csv2()
a;b;c 1. Use problems() to diagnose problems
The reverse side shows how to create
1 2 3 Reads Semi-colon delimited files.
1;2;3 x <- read_csv("file.csv"); problems(x)
tibbles with tibble and to layout tidy
4 5 NA read_csv2("file2.csv")
4;5;NA
data with tidyr. 2. Use a col_ function to guide parsing
A B C read_delim(delim, quote = "\"", escape_backslash = FALSE, • col_guess() - the default
Other types of data a|b|c escape_double = TRUE) Reads files with any delimiter.
1 2 3 • col_character()
Try one of the following packages to import 1|2|3 4 5 NA read_delim("file.txt", delim = "|") • col_double()
other types of files 4|5|NA • col_euro_double()
• haven - SPSS, Stata, and SAS files • col_datetime(format = "") Also
• readxl - excel files (.xls and .xlsx) A B C read_fwf(col_positions)
abc col_date(format = "") and col_time(format = "")
• DBI - databases
1 2 3 Reads fixed width files.
123 • col_factor(levels, ordered = FALSE)
• jsonlite - json
4 5 NA read_fwf("file.fwf", col_positions = c(1, 3, 5))
4 5 NA • col_integer()
• xml2 - XML read_tsv() • col_logical()
• httr - Web APIs Reads tab delimited files. Also read_table(). • col_number()
• rvest - HTML (Web Scraping) read_tsv("file.tsv") • col_numeric()
• col_skip()
x <- read_csv("file.csv", col_types = cols(
Write functions Useful arguments
A = col_double(),
a,b,c B = col_logical(),
Save x, an R object, to path, a file path, with: Example file 1 2 3 Skip lines
C = col_factor()
1,2,3 write_csv (path = "file.csv", read_csv("file.csv",
write_csv(x, path, na = "NA", append = FALSE, 4 5 NA ))
4,5,NA x = read_csv("a,b,c\n1,2,3\n4,5,NA")) skip = 1)
col_names = !append)
3. Else, read in as character vectors then parse
Tibble/df to comma delimited file. with a parse_ function.
A B C No header A B C
Read in a subset
write_delim(x, path, delim = " ", na = "NA", 1 2 3 1 2 3 • parse_guess(x, na = c("", "NA"), locale =
append = FALSE, col_names = !append) read_csv("file.csv", read_csv("file.csv",
4 5 NA default_locale())
col_names = FALSE) n_max = 1)
Tibble/df to file with any delimiter. • parse_character(x, na = c("", "NA"), locale =
A B C
write_excel_csv(x, path, na = "NA", append = x y z default_locale())
Provide header 1 2 3
FALSE, col_names = !append) A B C Missing Values • parse_datetime(x, format = "", na = c("", "NA"),
read_csv("file.csv", NA NA NA
locale = default_locale()) Also parse_date()
Tibble/df to a CSV for excel 1 2 3
col_names = c("x", "y", "z")) read_csv("file.csv",
and parse_time()
write_file(x, path, append = FALSE) 4 5 NA na = c("4", "5", "."))
• parse_double(x, na = c("", "NA"), locale =
String to file. default_locale())
Read non-tabular data
write_lines(x, path, na = "NA", append = • parse_factor(x, levels, ordered = FALSE, na =
FALSE) read_file(file, locale = default_locale())
read_lines_raw(file, skip = 0, n_max = -1L, c("", "NA"), locale = default_locale())
String vector to file, one element per line. Read a file into a single string. progress = interactive()) • parse_integer(x, na = c("", "NA"), locale =
write_rds(x, path, compress = c("none", "gz", read_file_raw(file) Read each line into a raw vector. default_locale())
"bz2", "xz"), ...) Read a file into a raw vector. • parse_logical(x, na = c("", "NA"), locale =
read_log(file, col_names = FALSE, col_types =
Object to RDS file. read_lines(file, skip = 0, n_max = -1L, locale = NULL, skip = 0, n_max = -1, progress = default_locale())
write_tsv(x, path, na = "NA", append = FALSE, default_locale(), na = character(), progress = interactive()) • parse_number(x, na = c("", "NA"), locale =
col_names = !append) interactive()) Apache style log files. default_locale())
Tibble/df to tab delimited files. Read each line into its own string. x$A <- parse_number(x$A)
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Tibbles - an enhanced data frame Tidy Data with tidyr
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new S3 class for
A table is tidy if: Tidy data: Split and Combine Cells
storing tabular data, the tibble. Tibbles inherit the A * B -> C
data frame class, but improve two behaviors: A B C A B C A B C A * B Use these functions to split or combine cells into
C
individual, isolated values.
• Display - When you print a tibble, R provides a
concise view of the data that fits on one screen. & separate(data, col, into, sep = "[^[:alnum:]]+",
• Subsetting - [ always returns a new tibble, remove = TRUE, convert = FALSE,
Each variable is in Each observation, or Makes variables easy Preserves cases during
[[ and $ always return a vector. extra = "warn", fill = "warn", ...)
its own column case, is in its own row to access as vectors vectorized operations
• No partial matching - You must use full Separate each cell in a column to make several
column names when subsetting Reshape Data - change the layout of values in a table columns.
table3
Use gather() and spread() to reorganize the values of a table into a new layout. Each uses the idea of a
# A tibble: 234 × 6
manufacturer model displ
<chr> <chr> <dbl> country year rate country year cases pop
1
2
audi
audi
a4
a4
1.8
1.8
key column: value column pair. A 1999 0.7K/19M A 1999 0.7K 19M
3 audi a4 2.0
gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = FALSE,
4 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
5 audi a4 2.8
6 audi a4 2.8 B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
convert = FALSE, factor_key = FALSE) drop = TRUE, sep = NULL) B 2000 80K/174M B 2000 80K 174
w
w
8 audi a4 quattro 1.8
9 audi a4 quattro 1.8 C 1999 212K/1T C 1999 212K 1T
10 audi a4 quattro 2.0
# ... with 224 more rows, and 3
# more variables: year <int>,
Gather moves column names into a key Spread moves the unique values of a key column C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr> column, gathering the column values into a into the column names, spreading the values of a
tibble display single value column. value column across the new columns that result. separate_rows(table3, rate,
156 1999 6 auto(l4) table4a table2
into = c("cases", "pop"))
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4)
country 1999 2000 country year cases country year type count country year cases pop
160 1999
161 1999
4 manual(m5)
4 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M separate_rows(data, ..., sep = "[^[:alnum:].]+",
162 2008 4 manual(m5) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
163 2008
164 2008
4 manual(m5)
4 auto(l4) C 212K 213K C 1999 212K A 2000 cases 2K B 1999 37K 172M convert = FALSE)
165 2008 4 auto(l4)
166 1999 4 auto(l4) A 2000 2K A 2000 pop 20M B 2000 80K 174M
A large table [ reached
-- omitted
getOption("max.print")
68 rows ] B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make several
to display data frame display C 2000 213K B 1999 pop 172M C 2000 213K 1T rows. Also separate_rows_().
key value B 2000 cases 80K
• Control the default appearance with options: table3
B 2000 pop 174M
country year rate country year rate
options(tibble.print_max = n, C 1999 cases 212K
A 1999 0.7K/19M A 1999 0.7K
tibble.print_min = m, tibble.width = Inf) C 1999 pop 1T
A 2000 2K/20M A 1999 19M
C 2000 cases 213K
B 1999 37K/172M A 2000 2K
• View entire data set with View(x, title) or C 2000 pop 1T
B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
glimpse(x, width = NULL, …) C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
(required for some older packages) B 2000 174M
Handle Missing Values C 1999 212K
Construct a tibble in two ways C 1999 1T
drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
tibble(…) replace = list(), ...) C 2000 1T
Drop rows containing Fill in NA’s in … columns with most
Construct by columns. Both make
NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
tibble(x = 1:3, this tibble x x x
y = c("a", "b", "c")) x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
A 1 A 1 A 1 A 1 A 1 A 1 unite(data, col, ..., sep = "_", remove = TRUE)
tribble(…) A tibble: 3 × 2 B NA D 3 B NA B 1 B NA B 2
Construct by rows. x y C
D
NA
3
C
D
NA
3
C
D
1
3
C
D
NA
3
C
D
2
3 Collapse cells across several columns to
tribble( <int> <dbl> E NA E NA E 3 E NA E 2 make a single column.
1 1 a
~x, ~y, 2 2 b table5
1, "a", 3 3 c drop_na(x, x2) fill(x, x2) replace_na(x,list(x2 = 2), x2)
country century year country year
2, "b", Afghan 19 99 Afghan 1999
3, "c") Expand Tables - quickly create tables with combinations of values Afghan 20 0 Afghan 2000
as_tibble(x, …) Convert data frame to tibble. Brazil 19 99 Brazil 1999
names column and a values column. values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb)
is_tibble(x) Test whether x is a tibble. col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Data Wrangling Tidy Data - A foundation for wrangling in R
with dplyr and tidyr F MA F MA
Tidy data complements R’s vectorized M * A F
Syntax - Helpful conventions for wrangling Reshaping Data - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than w
ww
w w
w
ww
w
w
Combine vectors into data frame
(optimized).
data frames. R displays only the data that fits onscreen: ww
1005
A 1005
A
1013
A dplyr::arrange(mtcars, mpg)
1013
A
1010
A 1010
A
tidyr::spread(pollution, size, amount) Order rows by values of a column
1010
A
Source: local data frame [150 x 5]
+ =
A 1 A T
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::left_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank)) B 2 F
Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2
dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns.
variable (with or without weights).
D T NA
x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function x1
A
x2
1
x3
T
dplyr::full_join(a, b, by = "x1")
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B
C
2
3
F
NA
Join data. Retain all values, all rows.
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T
dplyr::nth mean
C 3
dplyr::dense_rank dplyr::cummean All rows in a that do not have a match in b.
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector.
+ =
A 1 B 2
dplyr::n_distinct var Ranks. Ties get min rank. Cumulative sum B 2 C 3
Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdiff(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…)
x1
A
x2
1
Compute separate summary row for each group. Compute new variables by group.
B 2 dplyr::bind_rows(y, z)
C 3
B
C
2
3
Append z to y as new rows.
D 4
ir ir dplyr::bind_cols(y, z)
C x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
C 3 D 4 Caution: matches rows by position.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations
> DT[.N-1]
data.table “Take DT, subset rows using i, then calculate j grouped by by” > DT[,.N]
Return the penultimate row of the DT
Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is after := V1 V2 V1 combinations of groups specified in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base R’s data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is after :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the first and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4
Creating A data.table
3: 15.18 F -1.8893 3
1: A -0.478 22
4: 1619.71 D -0.3571 4 2: B -0.478 26
> DT[3:5,] Select 3rd to 5th row Indexing And Keys V1 V4.Sum
> DT[3:5] Select 3rd to 5th row 1: 1 36
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly 2: 2 42
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2 > DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
Manipulating on Columns in j by=V1][V4.Sum>40] (chaining)
1: 1 A -0.2392 1
2: 2 A -1.6148 4 V1 V4.Sum
3: 1 A 1.0498 7 1: 2 42
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
[1] “A” “B” “C” “A” “B” “C” ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C by=V1][order(-V1)] ordered on V1
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="first"] Return first row of all rows that match value A in key V1 V4.Sum
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 1: 2 42
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 2: 1 36
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1
V1 V4.Sum
> DT[c("A","C"),
sum(V4),
Return sum of column V4 for rows of V2 that have value A,
and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the first key (V1) and the c("V2.rating","V3.DC"))
value C for the second key (V2)
setnames()
V1 V2 V3 V4
sign V4.Sum
1: 0 36 1: 2 C 0.3262 6
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the first key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable you’re grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the specified vector (invisible)
2: 2 A 0.3262 10
by=V1] after subsetting on the first 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively
Data Transformation with data.table : : CHEAT SHEET
Basics Manipulate columns with j Group according to by
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native a a a dt[, j, by = .(a)] – group rows by
EXTRACT
data frame objects into data.tables with new and enhanced values in specified columns.
functionality. The basics of working with data.tables are: dt[, c(2)] – extract columns by number. Prefix
column numbers with “-” to drop. dt[, j, keyby = .(a)] – group and
dt[i, j, by] simultaneously sort rows by values
in specified columns.
Take data.table dt, b c b c dt[, .(b, c)] – extract columns by name.
subset rows using i COMMON GROUPED OPERATIONS
and manipulate columns with j,
grouped according to by. dt[, .(c = sum(b)), by = a] – summarize rows within groups.
data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows
1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A z 2 4 B 1 2 3 4
B x 1 3
value.var = c("a", "b")) ROW IDS
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z id y a b melt(dt, 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with within groups, duplicate a column with rows
+ = id.vars = c("id"),
A 1 2 3 4 A 1 1 3
2 a 5 2 c 5 1 c 5 2 equal and unequal values. B 1 2 3 4 B 1 1 3 leading by specified amount.
3 b 6 1 a 8 NA a 8 1 A 2 2 4 measure.vars = patterns("^a", "^b"),
B 2 2 4 variable.name = "y",
value.name = c("a", "b"))
ROLLING JOIN read & write files
a id date b id date a id date b
Reshape a data.table from wide to long format.
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 dt A data.table. IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1 id.vars ID columns with IDs for multiple entries.
3 A 01-01-2014 measure.vars Columns containing values to fill into cells (often in fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010
pattern form).
2 B 01-01-2012
variable.name, Names of new columns for variables and values fread("file.csv", select = c("a", "b")) – read specified columns from a
value.name derived from old headers. flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most
recent preceding match with the left data.table according to date
columns. “roll = -Inf” reverses direction. EXPORT
fwrite(dt, "file.csv") – write data to a flat file from R.
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
Machine Learning Modelling in R : : CHEAT SHEET
Supervised & Unsupervised Learning Meta-Algorithm, Time Series & Model Validation
Introduction • "center"
method getParamSet(learner=)
trafo
• "scale" lower=-2,upper=2,trafo=function(x) 10^x
• "standardize" "classif.qda"
• "range" range=c(0,1) Logical
LogicalVector CharacterVector DiscreteVector
mergeSmallFactorLevels(task=,cols=,min.perc=)
train(learner=,task=) makeTuneControl<type>()
WrappedModel • Grid(resolution=10L)
summarizeColumns(obj=) obj • Random(maxit=100)
• MBO(budget=)
• Irace(n.instances=)
capLargeValues dropFeatures getLearnerModel() • CMAES Design GenSA
removeConstantFeatures summarizeLevels
predict(object=,task=,newdata=) tuneParams(learner=,task=,resampling=,
measures=,par.set=,control=)
pred
View(pred)
makeClassifTask(data=,target=)
as.data.frame(pred)
A B C
positive Quickstart
makeRegrTask(data=,target=)
0 63 100 performance(pred=,measures=)
listMeasures() library(mlbench)
makeMultilabelTask(data=,target=)
• acc auc bac ber brier[.scaled] f1 fdr fn data(Soybean)
A B C fnr fp fpr gmean multiclass[.au1u .aunp .aunu soy = createDummyFeatures(Soybean,target="Class")
.brier] npv ppv qsr ssr tn tnr tp tpr wkappa tsk = makeClassifTask(data=soy,target="Class")
makeClusterTask(data=) • arsq expvar kendalltau mae mape medae ho = makeResampleInstance("Holdout",tsk)
medse mse msle rae rmse rmsle rrse rsq sae tsk.train = subsetTask(tsk,ho$train.inds[[1]])
spearmanrho sse tsk.test = subsetTask(tsk,ho$test.inds[[1]])
makeSurvTask(data=,target= • db dunn G1 G2 silhouette
c("time","event")) • multilabel[.f1 .subset01 .tpr .ppv
.acc .hamloss]
• mcp meancosts
• cindex
makeCostSensTask(data=,costs=) • featperc timeboth timepredict timetrain
A B lrn = makeLearner("classif.xgboost",nrounds=10)
cv = makeResampleDesc("CV",iters=5)
• calculateConfusionMatrix(pred=) res = resample(lrn,tsk.train,cv,acc)
task • calculateROCMeasures(pred=)
• weights=
• blocking=
makeResampleDesc(method=,...,stratify=)
method ps = makeParamSet(makeNumericParam("eta",0,1),
• "CV" iters= makeNumericParam("lambda",0,200),
makeLearner(cl=,predict.type=,...,par.vals=) • "LOO" iters= makeIntegerParam("max_depth",1,20))
• "RepCV" tc = makeTuneControlMBO(budget=100)
reps= folds= tr = tuneParams(lrn,tsk.train,cv5,acc,ps,tc)
• cl= "classif.xgboost" • "Subsample" lrn = setHyperPars(lrn,par.vals=tr$x)
"regr.randomForest" "cluster.kmeans" iters= split= eta lambda max_depth
• predict.type="response" • "Bootstrap" iters=
"prob" • "Holdout" split=
"se" stratify
• View(listLearners()) cv2
• View(listLearners(task)) cv3 cv5 cv10 hout
• View(listLearners("classif",
properties=c("prob", "factors"))) resample()
"classif" crossval() repcv() holdout() subsample()
"prob" "factors" bootstrapOOB() bootstrapB632() bootstrapB632plus()
• getLearnerProperties()
function(required_parameters=,optional_parameters=)
Configuration Feature Extraction Visualization Wrappers
configureMlr()
• show.info Wrapper 3, etc.
TRUE filterFeatures(task=,method=, generateThreshVsPerfData(obj=,measures=)
• on.learner.error "stop" perc=,abs=,threshold=) Wrapper 2
"warn" Wrapper 1
"quiet" "stop" • plotThreshVsPerf(obj)
• on.learner.warning Learner
"warn" "quiet" "warn" perc= abs= ThreshVsPerfData
• on.par.without.desc threshold= • plotROCCurves(obj)
"stop" "warn" "quiet" "stop"
• on.par.out.of.bounds method ThreshVsPerfData
"stop" "warn" "quiet" "stop" "randomForestSRC.rfsrc" measures=list(fpr,tpr) makeDummyFeaturesWrapper(learner=)
• on.measure.not.applicable "anova.test" "carscore" "cforest.importance" makeImputeWrapper(learner=,classes=,cols=)
"stop" "warn" "quiet" "stop" "chi.squared" "gain.ratio" "information.gain" makePreprocWrapper(learner=,train=,predict=)
• show.learner.output "kruskal.test" "linear.correlation" "mrmr" "oneR" makePreprocWrapperCaret(learner=,...)
TRUE "permutation.importance" "randomForest.importance" • plotResiduals(obj=) makeRemoveConstantFeaturesWrapper(learner=)
• on.error.dump "randomForestSRC.rfsrc" "randomForestSRC.var.select" Prediction BenchmarkResult
on.learner.error "stop" TRUE "rank.correlation" "relief"
"symmetrical.uncertainty" "univariate.model.score"
getMlrOptions() "variance" makeOverBaggingWrapper(learner=)
generateLearningCurveData(learners=,task=, makeSMOTEWrapper(learner=)
resampling=,percs=,measures=) makeUndersampleWrapper(learner=)
makeWeightedClassesWrapper(learner=)
Parallelization selectFeatures(learner=,task=
resampling=,measures=,control=)
• plotLearningCurve(obj=)
LearningCurveData
control makeCostSensClassifWrapper(learner=)
parallelMap
makeCostSensRegrWrapper(learner=)
makeCostSensWeightedPairsWrapper(learner=)
generateFilterValuesData(task=,method=)
• makeFeatSelControlExhaustive(max.features=)
parallelStart(mode=,cpus=,level=) max.features • plotFilterValues(obj=)
• makeMultilabelBinaryRelevanceWrapper(learner=)
• mode makeFeatSelControlRandom(maxit=,prob=,
makeMultilabelClassifierChainsWrapper(learner=)
• "local" mapply max.features=) FilterValuesData
makeMultilabelDBRWrapper(learner=)
• "multicore" prob maxit
makeMultilabelNestedStackingWrapper(learner=)
parallel::mclapply
• makeMultilabelStackingWrapper(learner=)
• "socket" makeFeatSelControlSequential(method=,maxit=,
• "mpi" max.features=,alpha=,beta=) generateHyperParsEffectData(tune.result=)
parallel::makeCluster parallel::clusterMap method "sfs"
• "BatchJobs" "sbs" "sffs"
"sfbs" alpha • plotHyperParsEffect(hyperpars.effec makeBaggingWrapper(learner=)
BatchJobs::batchMap
makeConstantClassWrapper(learner=)
• cpus beta t.data=,x=,y=,z=)
makeDownsampleWrapper(learner=,dw.perc=)
• level "mlr.benchmark"
• makeFeatSelControlGA(maxit=,max.features=,mu=, HyperParsEffectData makeFeatSelWrapper(learner=,resampling=,control=)
"mlr.resample" "mlr.selectFeatures"
lambda=,crossover.rate=,mutation.rate=) makeFilterWrapper(learner=,fw.perc=,fw.abs=,
"mlr.tuneParams" "mlr.ensemble"
• plotOptPath(op=) fw.threshold=)
<obj>$opt.path <obj> makeMultiClassWrapper(learner=)
parallelStop()
mu tuneResult featSelResult makeTuneWrapper(learner=,resampling=,par.set=,
lambda crossover.rate • plotTuneMultiCritResult(res=) control=)
Imputation mutation.rate
impute(obj=,target=,cols=,dummy.cols=,dummy.type=)
selectFeatures FeatSelResult
generatePartialDependenceData(obj=,input=) Nested Resampling
fsr tsk obj
tsk = subsetTask(tsk,features=fsr$x) input
• obj= • plotPartialDependence(obj=)
• target=
• cols= PartialDependenceData
• dummy.cols=
• dummy.type=
classes
"numeric"
dummy.classes cols
Benchmarking • resample benchmark
• makeTuneWrapper
benchmark(learners=,tasks=,resamplings=,measures=) • plotBMRBoxplots(bmr=) makeFeatSelWrapper
cols classes • plotBMRSummary(bmr=)
cols=list(V1=imputeMean()) V1 • plotBMRRanksAsBarChart(bmr=)
imputeMean()
Autoregression of Order p
Auto-correlation
X t = ϕ1X t−1 + ϕ2 X t−2 + … + ϕp X t−p + Wt
Use ACF and PACF to detect model
Moving Average of Order q
X t = Zt + θ1Zt−1 + θ2 Zt−2 + … + θq Zt−p arima(): To estimate parameters of an AM or
(Complete) Auto-correlation function: acf() OR: autoplot(forecast(arima_model, level=c(95),
ARMA model, and build model
ARMA (p, q) h=number_to_predict))
acf(data, type=‘correlation’, na.action=na.pass) arima(data, order=c(p, 0, q),method=c(‘ML’))
X t = ϕ1X t−1 + ϕ2 X t−2 + … + ϕp X t−p+
Zt + θ1Zt−1 + θ2 Zt−2 + … + θq Zt−p
RStudio® is a trademark of RStudio, Inc. • CC BY SA Yunjun Xia, Shuyu Huang • yx2569@columbia.edu, sh3967@columbia.edu • Updated: 2019-10
Deep Learning with Keras : : CHEAT SHEET Keras TensorFlow
Intro Define
INSTALLATION
The keras R package uses the Python keras library.
Compile Fit Evaluate Predict
Keras is a high-level neural networks API You can install all the prerequisites directly from R.
developed with a focus on enabling fast • Model • Batch size
• Sequential • Optimiser • Epochs • Evaluate • classes https://keras.rstudio.com/reference/install_keras.html
experimentation. It supports multiple back-
ends, including TensorFlow, CNTK and Theano. model • Loss • Validation • Plot • probability
library(keras) See ?install_keras
• Multi-GPU • Metrics split
install_keras() for GPU instructions
TensorFlow is a lower level mathematical model
library for building deep neural network This installs the required libraries in an Anaconda
architectures. The keras R package makes it https://keras.rstudio.com The “Hello, World!” environment or virtual environment 'r-tensorflow'.
easy to use Keras and TensorFlow in R. https://www.manning.com/books/deep-learning-with-r of deep learning
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
More layers Preprocessing
CONVOLUTIONAL LAYERS ACTIVATION LAYERS SEQUENCE PREPROCESSING Keras TensorFlow
layer_conv_1d() 1D, e.g.
temporal convolution
layer_activation(object, activation)
Apply an activation function to an output
pad_sequences()
Pads each sequence to the same length (length of Pre-trained models
the longest sequence)
layer_activation_leaky_relu() Keras applications are deep learning models
layer_conv_2d_transpose() Leaky version of a rectified linear unit skipgrams() that are made available alongside pre-trained
Transposed 2D (deconvolution) Generates skipgram word pairs weights. These models can be used for
α layer_activation_parametric_relu() prediction, feature extraction, and fine-tuning.
layer_conv_2d() 2D, e.g. spatial Parametric rectified linear unit make_sampling_table() application_xception()
convolution over images Generates word rank-based probabilistic sampling xception_preprocess_input()
layer_activation_thresholded_relu() table Xception v1 model
Thresholded rectified linear unit
layer_conv_3d_transpose()
Transposed 3D (deconvolution) layer_activation_elu() TEXT PREPROCESSING application_inception_v3()
layer_conv_3d() 3D, e.g. spatial Exponential linear unit inception_v3_preprocess_input()
text_tokenizer() Text tokenization utility Inception v3 model, with weights pre-trained
convolution over volumes
on ImageNet
fit_text_tokenizer() Update tokenizer internal
layer_conv_lstm_2d() vocabulary
Convolutional LSTM DROPOUT LAYERS application_inception_resnet_v2()
save_text_tokenizer(); load_text_tokenizer() inception_resnet_v2_preprocess_input()
layer_separable_conv_2d() layer_dropout() Inception-ResNet v2 model, with weights
Depthwise separable 2D Save a text tokenizer to an external file
Applies dropout to the input trained on ImageNet
layer_upsampling_1d() texts_to_sequences();
layer_spatial_dropout_1d() texts_to_sequences_generator() application_vgg16(); application_vgg19()
layer_upsampling_2d() layer_spatial_dropout_2d()
layer_upsampling_3d() Transforms each text in texts to sequence of integers VGG16 and VGG19 models
layer_spatial_dropout_3d()
Upsampling layer Spatial 1D to 3D version of dropout texts_to_matrix(); sequences_to_matrix() application_resnet50() ResNet50 model
layer_zero_padding_1d() Convert a list of sequences into a matrix
layer_zero_padding_2d() application_mobilenet()
layer_zero_padding_3d() RECURRENT LAYERS text_one_hot() One-hot encode text to word indices mobilenet_preprocess_input()
Zero-padding layer mobilenet_decode_predictions()
layer_simple_rnn() text_hashing_trick()
Fully-connected RNN where the output mobilenet_load_model_hdf5()
layer_cropping_1d() Converts a text to a sequence of indexes in a fixed-
layer_cropping_2d() is to be fed back to input MobileNet model architecture
size hashing space
layer_cropping_3d()
Cropping layer layer_gru() text_to_word_sequence()
Gated recurrent unit - Cho et al Convert text to a sequence of words (or tokens) ImageNet is a large database of images with
POOLING LAYERS
layer_cudnn_gru() labels, extensively used for deep learning
layer_max_pooling_1d() Fast GRU implementation backed IMAGE PREPROCESSING
layer_max_pooling_2d() by CuDNN imagenet_preprocess_input()
layer_max_pooling_3d() image_load() Loads an image into PIL format. imagenet_decode_predictions()
Maximum pooling for 1D to 3D layer_lstm() Preprocesses a tensor encoding a batch of
Long-Short Term Memory unit - flow_images_from_data() images for ImageNet, and decodes predictions
layer_average_pooling_1d() Hochreiter 1997 flow_images_from_directory()
layer_average_pooling_2d()
layer_average_pooling_3d()
Average pooling for 1D to 3D
layer_cudnn_lstm()
Fast LSTM implementation backed
Generates batches of augmented/normalized data
from images and labels, or a directory Callbacks
by CuDNN A callback is a set of functions to be applied at
layer_global_max_pooling_1d() image_data_generator() Generate minibatches of
image data with real-time data augmentation. given stages of the training procedure. You can
layer_global_max_pooling_2d() use callbacks to get a view on internal states
LOCALLY CONNECTED LAYERS
layer_global_max_pooling_3d() and statistics of the model during training.
Global maximum pooling fit_image_data_generator() Fit image data
layer_locally_connected_1d() generator internal statistics to some sample data callback_early_stopping() Stop training when
layer_global_average_pooling_1d() layer_locally_connected_2d() a monitored quantity has stopped improving
layer_global_average_pooling_2d() Similar to convolution, but weights are not generator_next() Retrieve the next item callback_learning_rate_scheduler() Learning
layer_global_average_pooling_3d() shared, i.e. different filters for each patch rate scheduler
Global average pooling image_to_array(); image_array_resize()
callback_tensorboard() TensorBoard basic
image_array_save() 3D array representation visualizations
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions a er any special characters [:blank:] .
have been parsed. string regexp matches example
(type this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:] [:symbol:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! / * @# | ` = + ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} - _ " ' [ ] { } ( ) ~ < > $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
a er all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B C D E F
writeLines("\\.") [:alnum:]
1
letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){}
# \. g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
mn o p q r MNOPQR
writeLines("\\ is a backslash") [:graph:] 1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
# \ is a backslash [:space:] 1 space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u v w x S T U VWX
[:blank:] 1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} y z Y Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs. To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
regexp matches regexp matches
FALSE, comments = FALSE, dotall = FALSE, ...)
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between
characters, line_breaks, sentences, or words. regexp matches example Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))
(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.4.0+ • Updated: 2021-08
ft
ft
and geoms—visual marks that represent data points. b <- ggplot(seals, aes(x = long, y = lat)) e <- ggplot(mpg, aes(cty, hwy)) h <- ggplot(diamonds, aes(carat, price))
+ = b + geom_curve(aes(yend = lat + 1,
family, fontface, hjust, lineheight, size, vjust
h + geom_density_2d()
xend = long + 1), curvature = 1) - x, xend, y, yend, e + geom_point()
x, y, alpha, color, group, linetype, size
data geom
coordinate plot alpha, angle, color, curvature, linetype, size
and y locations.
a + geom_polygon(aes(alpha = 50)) - x, y, alpha, e + geom_rug(sides = “bl")
continuous function
i + geom_area()
x, y, alpha, color, fill, linetype, size
data geom
coordinate plot ymax, ymin, alpha, color, fill, linetype, size
x = F · y = A
system e + geom_text(aes(label = cty), nudge_x = 1, i + geom_line()
color = F
a + geom_ribbon(aes(ymin = unemploy - 900, nudge_y = 1) - x, y, label, alpha, angle, color,
size = A ymax = unemploy + 900)) - x, ymax, ymin,
x, y, alpha, color, group, linetype, size
ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot ONE VARIABLE continuous
that you finish by adding layers to. Add one geom c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg) f + geom_dotplot(binaxis = "y", stackdir = “center") j + geom_linerange()
function per layer.
x, y, alpha, color, fill, group
x, ymin, ymax, alpha, color, group, linetype, size
c + geom_area(stat = "bin")
last_plot() Returns the last plot.
x, y, alpha, color, fill, linetype, size
f + geom_violin(scale = “area")
j + geom_pointrange() - x, y, ymin, ymax,
x, y, alpha, color, fill, group, linetype, size, weight alpha, color, fill, group, linetype, shape, size
ggsave("plot.png", width = 5, height = 5) Saves last plot c + geom_density(kernel = "gaussian")
as 5’ x 5’ file named "plot.png" in working directory. x, y, alpha, color, fill, group, linetype, size, weight
c + geom_dotplot()
g <- ggplot(diamonds, aes(cut, color)) data <- data.frame(murder = USArrests$Murder,
x, y, alpha, color, fill
state = tolower(rownames(USArrests)))
Aes Common aesthetic values. c + geom_freqpoly()
x, y, alpha, color, group, linetype, size
g + geom_count()
x, y, alpha, color, fill, shape, size, stroke
c2 + geom_qq(aes(sample = hwy))
lineend - string ("round", "butt", or "square")
x, y, alpha, color, fill, linetype, size, weight THREE VARIABLES
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
Stats An alternative way to build a layer. Scales Override defaults with scales package. Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) - xlim, ylim subplots based on the
n <- d + geom_bar(aes(fill = fl)) The default cartesian coordinate system. values of one or more
+ =
x ..count..
discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim - Cartesian coordinates with t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot
x=x· system n + scale_fill_manual( fixed aspect ratio between x and y units.
y = ..count.. values = c("skyblue", "royalblue", "blue", "navy"), t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), ggplot(mpg, aes(y = fl)) + geom_bar() Facet into columns based on fl.
name = "fuel", labels = c("D", "E", "P", "R")) Flip cartesian coordinates by switching
function, geom_bar(stat="count") or by using a stat
x and y aesthetic mappings. t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in
values to include legend/axis in legend/axis legend/axis Facet into rows based on year.
geom to make a layer (equivalent to a geom function). in mapping
Use ..name.. syntax to map stat variables to aesthetics. r + coord_polar(theta = "x", direction=1)
theta, start, direction - Polar coordinates. t + facet_grid(rows = vars(year), cols = vars(fl))
GENERAL PURPOSE SCALES Facet into both rows and columns.
geom to use stat function geommappings r + coord_trans(y = “sqrt") - x, y, xlim, ylim t + facet_wrap(vars(fl))
Use with most aesthetics Transformed cartesian coordinates. Set xtrans
i + stat_density_2d(aes(fill = ..level..), Wrap facets into a rectangular layout.
scale_*_continuous() - Map cont’ values to visual ones. and ytrans to the name of a window function.
geom = "polygon")
variable created by stat scale_*_discrete() - Map discrete values to visual ones. Set scales to let axis limits vary across facets.
scale_*_binned() - Map continuous values to discrete bins. π + coord_quickmap()
60
π + coord_map(projection = "ortho", orientation t + facet_grid(rows = vars(drv), cols = vars(fl),
c + stat_bin(binwidth = 1, boundary = 10) scale_*_identity() - Use data values as visual ones. = c(41, -74, 0)) - projection, xlim, ylim scales = "free")
lat
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - Map discrete values to Map projections from the mapproj package x and y axis limits adjust to individual facets:
manually chosen visual ones.
c + stat_count(width = 1) x, y | ..count.., ..prop.. long
(mercator (default), azequalarea, lagrange, etc.). "free_x" - x axis limits adjust
scale_*_date(date_labels = "%m/%d"), "free_y" - y axis limits adjust
c + stat_density(adjust = 1, kernel = "gaussian") date_breaks = "2 weeks") - Treat data values as dates.
x, y | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)
scale_*_datetime() - Treat data values as date times.
Same as scale_*_date(). See ?strptime for label formats.
Position Adjustments Set labeller to adjust facet label:
t + facet_grid(cols = vars(fl), labeller = label_both)
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms fl: c fl: d fl: e fl: p fl: r
X & Y LOCATION SCALES that would otherwise occupy the same space.
e + stat_bin_hex(bins = 30) x, y, fill | ..count.., ..density.. t + facet_grid(rows = vars(fl),
Use with x or y aesthetics (x shown here) s <- ggplot(mpg, aes(fl, fill = drv)) labeller = label_bquote(alpha ^ .(fl)))
e + stat_density_2d(contour = TRUE, n = 100)
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale. ↵c ↵d ↵e ↵p ↵r
scale_x_reverse() - Reverse the direction of the x axis. s + geom_bar(position = "dodge")
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_sqrt() - Plot x on square root scale. Arrange elements side by side.
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
l + stat_summary_hex(aes(z = z), bins = 30, fun = max) COLOR AND FILL SCALES (DISCRETE)
s + geom_bar(position = "fill")
Stack elements on top of one
Labels and Legends
x, y, z, fill | ..value.. another, normalize height. Use labs() to label the elements of your plot.
n + scale_fill_brewer(palette = "Blues")
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean) For palette choices: e + geom_point(position = "jitter") t + labs(x = "New x axis label", y = "New y axis label",
x, y, z, fill | ..value.. RColorBrewer::display.brewer.all() Add random noise to X and Y position of title ="Add a title above the plot",
each element to avoid overplotting. subtitle = "Add a subtitle below title",
f + stat_boxplot(coef = 1.5) n + scale_fill_grey(start = 0.2, A caption = "Add a caption below plot",
x, y | ..lower.., ..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. end = 0.8, na.value = "red") e + geom_label(position = "nudge") alt = "Add alt text to the plot",
B
Nudge labels away from points. <aes> = "New <aes>
<AES> <AES> legend title")
f + stat_ydensity(kernel = "gaussian", scale = "area") x, y
| ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. COLOR AND FILL SCALES (CONTINUOUS) s + geom_bar(position = "stack") t + annotate(geom = "text", x = 8, y = 9, label = “A")
Stack elements on top of one another. Places a geom with manually selected aesthetics.
e + stat_ecdf(n = 40) x, y | ..x.., ..y.. o <- c + geom_dotplot(aes(fill = ..x..))
e + stat_quantile(quantiles = c(0.1, 0.9), Each position adjustment can be recast as a function p + guides(x = guide_axis(n.dodge = 2)) Avoid crowded
o + scale_fill_distiller(palette = “Blues”) with manual width and height arguments: or overlapping labels with guide_axis(n.dodge or angle).
formula = y ~ log(x), method = "rq") x, y | ..quantile..
s + geom_bar(position = position_dodge(width = 1)) n + guides(fill = “none") Set legend type for each
e + stat_smooth(method = "lm", formula = y ~ x, se = T, o + scale_fill_gradient(low="red", high=“yellow") aesthetic: colorbar, legend, or none (no legend).
level = 0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
ggplot() + xlim(-5, 5) + stat_function(fun = dnorm,
o + scale_fill_gradient2(low = "red", high = “blue”,
mid = "white", midpoint = 25) Themes n + theme(legend.position = "bottom")
Place legend at "bottom", "top", "le ", or “right”.
n = 20, geom = “point”) x | ..x.., ..y.. n + scale_fill_discrete(name = "Title",
ggplot() + stat_qq(aes(sample = 1:100)) o + scale_fill_gradientn(colors = topo.colors(6)) r + theme_bw() r + theme_classic() labels = c("A", "B", "C", "D", "E"))
x, y, sample | ..sample.., ..theoretical.. Also: rainbow(), heat.colors(), terrain.colors(), White background Set legend title and labels with a scale function.
cm.colors(), RColorBrewer::brewer.pal() with grid lines. r + theme_light()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot")
h + stat_summary_bin(fun = "mean", geom = "bar")
SHAPE AND SIZE SCALES
r + theme_gray()
Grey background
r + theme_linedraw()
r + theme_minimal()
Zooming
p <- e + geom_point(aes(shape = fl, size = cyl)) (default theme). Minimal theme. Without clipping (preferred):
e + stat_identity() p + scale_shape() + scale_size() r + theme_dark() r + theme_void() t + coord_cartesian(xlim = c(0, 100), ylim = c(10, 20))
e + stat_unique() p + scale_shape_manual(values = c(3:7)) Dark for contrast. Empty theme.
With clipping (removes unseen data points):
r + theme() Customize aspects of the theme such
as axis, legend, panel, and facet properties. t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) r + ggtitle(“Title”) + theme(plot.title.postion = “plot”) t + scale_x_continuous(limits = c(0, 100)) +
r + theme(panel.background = element_rect(fill = “blue”)) scale_y_continuous(limits = c(0, 100))
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
ft
forwards
Open in new Save Find and
backwards/ window replace
Compile as Run
notebook selected
code
Import data History of past
with wizard commands to
run/copy
Manage
external
View
memory
databases usage
R tutorials
Control
and more in Source Pane Turn on at Tools > Project Options > Git/SVN
A• Added M• Modified
Check Render Choose Configure Insert D• Deleted R• Renamed
?• Untracked
Package Development
Click next to line number to Highlighted line shows where
RStudio opens plots in a dedicated Plots pane RStudio opens documentation in a dedicated Help pane add/remove a breakpoint. execution has paused
Create a new package with
File > New Project > New Directory > R Package
Navigate Open in Export Delete Delete
Enable roxygen documentation with recent plots window plot plot all plots Home page of Search within Search for
Tools > Project Options > Build Tools helpful links help file help file
Roxygen guide at Help > Roxygen Quick Reference
See package information in the Build Tab Viewer pane displays HTML content, such as Shiny
apps, RMarkdown reports, and interactive visualizations
GUI Package manager lists every installed package
Install package Run devtools::load_all()
and restart R and reload changes
Stop Shiny Publish to shinyapps.io, Refresh
Install Update Browse app rpubs, RSConnect, … Run commands in Examine variables Select function
Packages Packages package site environment where in executing in traceback to
Clear output execution has paused environment debug
Run R CMD and rebuild
check View(<data>) opens spreadsheet like view of data set
Customize Run Click to load package with Package Delete
package build package library(). Unclick to detach version from
options tests package with detach(). installed library
Filter rows by value Sort by Search Step through Step into and Resume Quit debug
or value range values for value code one line out of functions execution mode
at a time to run
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07
a
ff
ff
RUN CODE
Search command history
Windows/Linux
Ctrl+a
Mac
Cmd+a
DOCUMENTS AND APPS
Knit Document (knitr) Ctrl+Shift+K Cmd+Shift+K
Workbench
Interrupt current command Esc Esc Insert chunk (Sweave & Knitr) Ctrl+Alt+I Cmd+Option+I WHY RSTUDIO WORKBENCH?
Clear console Ctrl+L Ctrl+L Run from start to current line Ctrl+Alt+B Cmd+Option+B Extend the open source server with a
commercial license, support, and more:
NAVIGATE CODE MORE KEYBOARD SHORTCUTS
Go to File/Function Ctrl+. Ctrl+. Keyboard Shortcuts Help Alt+Shift+K Option+Shift+K • open and run multiple R sessions at once
Show Command Palette Ctrl+Shift+P Cmd+Shift+P • tune your resources to improve performance
WRITE CODE
Attempt completion Tab or Tab or
• administrative tools for managing user sessions
Ctrl+Space Ctrl+Space View the Keyboard Shortcut Quick Search for keyboard shortcuts with • collaborate real-time with others in shared projects
Insert <- (assignment operator) Alt+- Option+- Reference with Tools > Keyboard Tools > Show Command Palette • switch easily from one version of R to a different version
Shortcuts or Alt/Option + Shift + K or Ctrl/Cmd + Shift + P.
Insert %>% (pipe operator) Ctrl+Shift+M Cmd+Shift+M • integrate with your authentication, authorization, and audit practices
(Un)Comment selection Ctrl+Shift+C Cmd+Shift+C • work in the RStudio IDE, JupyterLab, Jupyter Notebooks, or VS Code
MAKE PACKAGES Windows/Linux Mac Download a free 45 day evaluation at
Visual Editor
workspace, and working Start new R Session Close R Session
Choose Choose Insert Jump to Jump Run directory associated with a in current project in project
Check Render output output code previous to next selected Publish Show file project. It reloads each when
spelling output format location chunk chunk chunk lines to server outline you re-open a project.
T H J
Back to
Source Editor
Block (front page) Active shared
format collaborators
Name of
current
Lists and Links Citations Images File outline project
Insert blocks, Select
block
citations, Insert and Share Project R Version
quotes More
formatting equations, and edit tables with Collaborators
Clear special
formatting characters
Insert
verbatim
code
Run Remote Jobs
Run R on remote clusters
(Kubernetes/Slurm) via the
Job Launcher
Add/Edit
attributes Monitor Launch a job
launcher jobs
Run launcher
jobs remotely
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07
6.3 SQL language
15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi
-- Conditions........................optional
WHERE some_condition(s)
-- Aggregating.......................optional LEFT JOIN
GROUP BY column_group_list
-- Sorting values....................optional
ORDER BY column_order_list
RIGHT JOIN
-- Restricting aggregated values.....optional
HAVING some_condition(s)
-- Limiting number of rows...........optional
LIMIT some_value FULL JOIN
Remark: the SELECT DISTINCT command can be used to ensure not having duplicate rows.
r Condition – A condition is of the following format: Remark: joining every row of table 1 with every row of table 2 can be done with the CROSS JOIN
command, and is commonly known as the cartesian product.
SQL
some_col some_operator some_col_or_value
Aggregations
where some_operator can be among the following common operations: r Grouping data – Aggregate metrics are computed on grouped data in the following way:
WHERE HAVING
- Filter condition applies to individual rows - Filter condition applies to aggregates
- Statement placed right after FROM - Statement placed right after GROUP BY
Remark: if WHERE and HAVING are both in the same query, WHERE will be executed first.
r Grouping sets – The GROUPING SETS command is useful when there is a need to compute
aggregations across different dimensions at a time. Below is an example of how all aggregations
across two dimensions are computed:
The SQL command is as follows:
SQL
SQL
SELECT
....col_1, some_window_function() OVER(PARTITION BY some_col ORDER BY another_col)
....col_2,
....agg_function(col_3) Remark: window functions are only allowed in the SELECT clause.
FROM table
GROUP BY ( r Row numbering – The table below summarizes the main commands that rank each row
..GROUPING SETS across specified groups, ordered by a specific column:
....(col_1),
....(col_2),
....(col_1, col_2) Command Description Example
)
ROW_NUMBER() Ties are given different ranks 1, 2, 3, 4
RANK() Ties are given same rank and skip numbers 1, 2, 2, 4
r Aggregation functions – The table below summarizes the main aggregate functions that
can be used in an aggregation query: DENSE_RANK() Ties are given same rank and don’t skip numbers 1, 2, 2, 3
r SQL tips – In order to keep the query in a clear and concise format, the following tricks are Take first non-NULL value COALESCE(col_1, col_2, ..., col_n)
often done: General Create a new column
CONCAT(col_1, ..., col_n)
Operation Command Description combining existing ones
Renaming New column names shown in Value Round value to n decimals ROUND(col, n)
SELECT operation_on_column AS col_name
columns query results Converts string column to
LOWER(col) / UPPER(col)
Abbreviation used within lower / upper case
Abbreviating
FROM table_1 t1 query for simplicity in Replace occurrences of
tables REPLACE(col, old, new)
notations old in col to new
Specify column position in String Take the substring of col,
Simplifying SUBSTR(col, start, length)
GROUP BY col_number_list SELECT clause instead of with a given start and length
group by
whole column names
Remove spaces from the
Limiting LTRIM(col) / RTRIM(col) / TRIM(col)
LIMIT n Display only n rows left / right / both sides
results
Length of the string LENGTH(col)
Truncate at a given granularity
r Sorting values – The query results can be sorted along a given set of columns using the DATE_TRUNC(time_dimension, col_date)
following command: Date (year, month, week)
Command Description
...
OVERWRITE Overwrites existing data
cte_n AS (
SELECT ... INTO Appends to existing data
)
SELECT ... r Dropping table – Tables are dropped in the following way:
FROM ...
SQL
DROP TABLE table_name;
Table manipulation
r View – Instead of using a complicated query, the latter can be saved as a view which can
r Table creation – The creation of a table is done as follows: then be used to get the data. A view is created with the following command:
SQL SQL
CREATE [table_type] TABLE [creation_type] table_name( CREATE VIEW view_name AS complicated_query;
..col_1 data_type_1,
...................,
..col_n data_type_n Remark: a view does not create any physical table and is instead seen as a shortcut.
)
[options];
r Data insertion – New data can either append or overwrite already existing data in a given
table as follows:
SQL
WITH ..............................-- optional
INSERT [insert_type] table_name....-- mandatory
SELECT ...;........................-- mandatory
@AbzAaron
CHEATSHEET
a b
a LEFT JOIN b
Select all rows from col1 & col2 ordering by col1
a b
Return count of rows in table
a RIGHT JOIN b
a b
FROM
GROUP BY
HAVING
Data Manipulation Language
UPDATE INSERT
SELECT Implementation of CASE statement
ORDER BY
A B A B
WHERE col4 = 1 AND col5 = 2
-- aggregate the data
GROUP by …
-- limit aggregated data
HAVING count(*) > 1
-- order of the results
ORDER BY col2 LEFT OUTER JOIN - all rows from table A, INNER JOIN - fetch the results that RIGHT OUTER JOIN - all rows from table B,
even if they do not exist in table B exist in both tables even if they do not exist in table A
Useful keywords for SELECTS:
DISTINCT - return unique results
BETWEEN a AND b - limit the range, the values can be Updates on JOINed Queries Useful Utility Functions
numbers, text, or dates You can use JOINs in your UPDATEs -- convert strings to dates:
LIKE - pattern search within the column text UPDATE t1 SET a = 1 TO_DATE (Oracle, PostgreSQL), STR_TO_DATE (MySQL)
IN (a, b, c) - check if the value is contained among given. FROM table1 t1 JOIN table2 t2 ON t1.id = t2.t1_id -- return the first non-NULL argument:
WHERE t1.col1 = 0 AND t2.col2 IS NULL; COALESCE (col1, col2, “default value”)
-- return current time:
Data Modification NB! Use database specific syntax, it might be faster! CURRENT_TIMESTAMP
-- update specific data with the WHERE clause -- compute set operations on two result sets
UPDATE table1 SET col1 = 1 WHERE col2 = 2 Semi JOINs SELECT col1, col2 FROM table1
-- insert values manually UNION / EXCEPT / INTERSECT
You can use subqueries instead of JOINs: SELECT col3, col4 FROM table2;
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME)
VALUES (1, ‘Rebel’, ‘Labs’); SELECT col1, col2 FROM table1 WHERE id IN
-- or by using the results of a query (SELECT t1_id FROM table2 WHERE date > Union - returns data from both queries
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME) CURRENT_TIMESTAMP) Except - rows from the first query that are not present
SELECT id, last_name, first_name FROM table2 in the second query
Intersect - rows that are returned from both queries
Indexes
Views If you query by a column, index it!
A VIEW is a virtual table, which is a result of a query. CREATE INDEX index1 ON table1 (col1) Reporting
They can be used to create virtual tables of complex queries. Use aggregation functions
Don’t forget:
CREATE VIEW view1 AS COUNT - return the number of rows
SELECT col1, col2 Avoid overlapping indexes SUM - cumulate the values
FROM table1 Avoid indexing on too many columns AVG - return the average for the group
WHERE … Indexes can speed up DELETE and UPDATE operations MIN / MAX - smallest / largest value
7. Business Intelligence
7.1 Tableau
3. Measures: A measure is a type of field that contains quantitative values (e.g. revenue, costs, and Aggregating data
market sizes). When dragged into a view, this data is aggregated, which is determined by the When data is dragged into the Rows and Columns on a sheet, it is aggregated based on the dimensions in the sheet.
dimensions in the view
This is typically a summed value. The default aggregation can be changed using the steps below:
4. Data types: Every field has a data type which is determined by the type of information it contains.
The available data types in Tableau include text, date values, date & time values, numerical values, Right-click on a measure field in the Data pan
Go down to Default properties, Aggregation, and select the aggregation you would like to use
Tableau for Business Intelligence
boolean values, geographical values, and cluster groups
Changing colors
Color is a critical component of visualizations. It draws attention to details. Attention is the most important
The Canvas
Tableau Basics Cheat Sheet The canvas is where you’ll create data visualizations
component of strong storytelling. Colors in a graph can be set using the marks card.
Create a visualization by dragging fields into the Rows and Columns section at the top of the scree
Drag dimensions into the Marks field, specifically into the Color squar
1. Tableau Canvas: The canvas takes up most of the screen on Tableau and is where you can add visualizations
To change from the default colors, go to the upper-right corner of the color legend and select Edit Colors. This
earn Tableau online at www.DataCamp.com
L 2. Rows and columns: Rows and columns dictate how the data is displayed in the canvas. When dimensions will bring up a dialog that allows you to select a different palette
are placed, they create headers for the rows or columns while measures add quantitative values
3. Marks card: The marks card allows users to add visual details such as color, size, labels, etc. to rows and columns. Changing fonts
This is done by dragging fields from the data pane into the marks card
Fonts can help with the aesthetic of the visualization or help with consistent branding. To change the workbook s font,
’
Tableau is a business intelligence tool that allows you to Upload a dataset to Tableau
effectively report insights through easy-to-use
Launch Tablea
customizable visualizations and dashboards
In the Connect section, under To a File, press on the file format of your choice
For selecting an Excel file, select .xlsx or .xlsx
> Creating dashboards with Tableau
Creating your first visualization Dashboards are an excellent way to consolidate visualizations and present data to a variety of stakeholders. Here is a
Once your file is uploaded, open a Worksheet and click on the Data pane on the left-hand sid step by step process you can follow to create a dashboard.
> Why use Tableau? Drag and drop at least one field into the Columns section, and one field into the Rows section at the top
of the canva
To add more detail, drag and drop a dimension into the Marks card (e.g. drag a dimension over the color square
Launch Tablea
In the Connect section under To A File, press on your desired file typ
Select your fil
in the marks card to color visualization components by that dimension Click the New Sheet at the bottom to create a new shee
Easy to use—no coding Integrates seamlessly with Fast and can handle large To a summary insight like a trendline, click on the Analytics pane and drag the trend line into your visualization
involved any data source datasets Create a visualization in the sheet by following the steps in the previous sections of this cheat shee
You can change the type of visualization for your data by clicking on the Show Me button on the top right Repeat steps 4 and 5 untill you have created all the visualizations you want to include in your dashboar
Click the New Dashboard at the bottom of the scree
On the left-hand side, you will see all your created sheets. Drag sheets into the dashboar
> Tableau Versions > Data Visualizations in Tableau
Adjust the layout of your sheets by dragging and dropping your visualizations
Stacked Bar Chart: Used to show categorical data within a bar chart (e.g., sales by region and department)
Side-by-Side Bar Chart: Used to compare values across categories in a bar chart format (e.g., sales by
> Getting started with Tableau region comparing product types)
L ine Charts: Used for looking at a numeric value over time (e.g., revenue over time)
When working with Tableau, you will work with Workbooks. Workbooks contain sheets, dashboards, and stories.
Similar to Microsoft Excel, a Workbook can contain multiple sheets. A sheet can be any of the following and can be Scatter Plot: Used to identify patterns between two continuous variables (e.g., profit vs. sales volume)
Worksheet Dashboard story Box-and-Whisker Plot: Used to compare distributions between categorical variables (e.g., distribution of
A worksheet is a single
view in a workbook. You
A collection of multiple
worksheets used to
A story is a collection of
multiple dashboards and/
revenue by region)
eat Map: Used to visualize data in rows and columns as colors (e.g., revenue by marketing channel)
> The Anatomy of a Worksheet M ap: Used to show geographical data with color formatting (e.g., Covid cases by state)
Treemap: Used to show hierarchical data (e.g., Show how much revenue subdivisions generate relative to
A story is made of story points, which lets you cycle through different visualizations and dashboard
To begin adding to the story, add a story point from the left-hand side. You can add a blank story poin
To add a summary text to the story, click Add a caption and summarize the story poin
the whole department within an organization)
Add as many story points as you would like to finalize your data story
When opening a worksheet, you will work with a variety of tools and interfaces
Dual Co bination: Used to show two visualizations within the same visualization (e.g., pro it or a store each
m f f
v v
The Sidebar
In the sidebar, you’ll find useful panes for working with dat
Data: The data pane on the left-hand side contains all of the fields in the currently selected data sourc
> Customizing Visualizations with Tableau
Analytics: The analytics pane on the left-hand side lets you add useful insights like trend lines, error bars,
Tableau provides a deep ability to filter, format, aggregate, customize, and highlight specific parts of your data
and other useful summaries to visualizations
visualizations
Drag-and-drop a field you want to filter on and add it to the Filters car
Fill out in the modal how you would like your visuals to be filtered on the data
Learn Data Skills Online at www.DataCamp.com
Ta bl ea u -Des k top Data Sources
Types Work Types Work
C H E AT S H E E T
Filter Dimensions Applied on the dimension fields. Multiple Values (List) Select one or more values in a list.
Filter Measures Applied on the measure fields. Multiple Values (Dropdown) Select one or more values in a drop-down list.
Data Sources Filter Dates Applied on the date fields. Multiple Values (Custom List) Search and select one or more values.
File Systems CSV, Excel, etc. Single Value (List) Select one value at a time in a list. Single Value (Slider) Drag a horizontal slider to select a single value.
Relational Systems Oracle, Sql Server, DB2, etc.
Single Value (Dropdown) Select a single value in a drop-down list. Wildcard Match Select values containing the specified characters.
Cloud Systems Windows Azure, Google BigQuery, etc.
Other Sources ODBC Tableau Charts
Data Extract Type Description
• Extraction of data is done by following Text Table (Crosstab) To see your data in rows and columns.
Me u → Data → E tract Data.
•
Heat Map Just like Crosstab, but it uses size and color as visual cues to describe the data.
Applying Extract Filters to create subset of data
•
Highlight Table Just like Excel table, but the cells here are colored.
To add more data for an already created extract
Data → E tract → Appe d Data fro File Symbol Map Visualize and highlight geographical data.
• •Relational
Extract History Operators Filled Map Color filled geographical data visualization.
Menu - Data → E tract Histor Pie Chart Represents data as slices of a circle with different sizes and colors.
•Data
Logical
Joining Operators
Data Blending
Horizontal Bar Chart
Stacked Bar Chart
Represents data in horizontal bars, visually digestible.
Visualize data of a category having sub-categories.
• Creating a Join • Preparing Data for Blending
Side-by-Side Bar Chart Side by side comparison of data, vertical representation.
• Editing a Join Type • Adding Secondary Data Source
Treemap Similar to a heat map, but the boxes are grouped by items that are close in hierarchy.
• Editing Join Fields • Blending the Data
Circle View Shows the different values that are within the categories.
Operators Side-by-Side Circle View fields) Combination of Circle view and Side-by-Side Bar Chart
• General Operators • Relational Operators Line Chart (Continuous) Several number of lines in the view to show continuous flow of data, must have a date.
• Arithmetic Operators • Logical Operators Line Chart (Discrete) This allows slicing and dicing of the graph, graph not continuous.
Dual Line Chart Comparing two measures over a period.
LOD Expressions
Scatter Plot Scatter plot shows many points scattered in the Cartesian plane
• Fixed LOD , Include LOD and Exclude LOD Histogram A histogram represents the frequencies of values of a variable bucketed into ranges
Gantt Chart It illustrates a project schedule.
Sorting
Bullet Graph Two bars drawn upon one another to indicate their individual values at the same position in the graph
• Computed Sorting: Directly applied on an axis using the sort
Waterfall Chart It shows where a value starts, ends and how it gets there incrementally
dialog button.
• Manual Sorting: Rearrange the order of dimension fields by FURTHERMORE:
dragging them next to each other. Tableau Training and Certification - Tableau 10 Desktop Course
File Short Cuts: Tableau Terminologies
Shortcuts &
ALT+F+E+I Export to image
Bookmark: .tbm file in the Bookmarks folder contains a single worksheet of the Tableau repository
ALT+F+E+P Export to packaged workbook
Calculated Field: New field created by using a formula to modify the existing fields in data source
CTRL+N New workbook
Crosstab: Text table view to display the numbers associated with dimension members
Te r m i n o l o g i e s
CTRL+O Open file
Dashboard: Use dashboards to compare and monitor a variety of data simultaneously
CTRL+P Print
Data Pane: Displays the fields (divided into dimensions and measures) of the data sources to which Tableau is
CTRL+S Save file
C H E AT S H E E T
connected
File Short Cuts: Data Source Page: A page to set up your data source consists of − left pane, join area, preview area, and metadata
A powerful Data visualization and Business intelligence tool with a ALT+A+C Create calculated field Extract: A saved subset of a data source that can be used to improve performance and analyze offline.
strong and intuitive interface. No coding knowledge or experience ALT+A+B Describe trend model Filters Shelf: Used to exclude data from a view by filtering it using measures and dimensions
Connect to data sourceBuild data viewsEnhance
needed to work with Tableau. ALT+A+U data viewsWorksheetsCreate
Edit calculated field Format Pane: Contains formatting and organize
settings that control the entiredashboardsStory
worksheet, as well as individual fields in the view
Data Short
Telling Cuts: ALT+A+L Edit trend lines
Level of Detail (LOD) Expression: A syntax that supports aggregation at dimensionalities other than the view level.
Data Operation ALT+A+F Filter
Marks: A part of the view that visually represents one or more rows in a data source. A mark can be, for example, a
ALT+D+A Automatic updates CTRL+1 Show me!
bar, line, or square. You can control the type, color, and size of marks
ALT+D+D Connect to data ALT+A+S Sort
Marks Card: A card to the left of the view, where you can drag fields to control mark properties such as type, color,
CTRL+D Connect to data source ALT+A+M+F Stack marks off
size, shape, label, tooltip, and detail
ALT+D+C+D Duplicate data connection ALT+A+M+O Stack marks on
Pages Shelf: Used to split a view into a sequence of pages based on the members and values in a discrete or
ALT+D+A Extract Get started with Tableau continuous field
ALT+D+C+P Properties of data connection
• Tableau is BI software Rows Shelf: Used to create the rows of a data table, also accepts any number of dimensions and measures
ALT+D+R Refresh data
• Allows users to connect to data, visualize data and create Worksheet: A sheet to build views of the data
F5 Refreshes the data source
interactive and sharable dashboards. Workbook: Contains one or more worksheets
F9 Run query
ALT+D+U Toggle use extract Design Flow
F10 Toggles automatic updates on and off
Connect to Data Source Build Data Views Enhance Data Views Worksheets Create and Organize Dashboards Story Telling
ALT+D+P+D Update dashboard
ALT+D+P+Q Update quick filters FURTHERMORE:
ALT+D+P+W Update worksheet Tableau Training and Certification - Tableau 10 Desktop Course
7.2 Power BI
Appending datasets
Create your first visualization You can append one dataset to anothe
Click on the Report View and go to the Visualizations pane on the right-hand sid Click on Append Queries under the Home tab under the Combine grou
Select the type of visualization you would like to plot your data on. Keep reading this cheat to learn different Select to append either Two tables or Three or more table
Add tables to append under the provided section in the same window
Merge Queries
Power BI Cheat Sheet Click on Merge Queries under the Home tab under the Combine grou
Select the first table and the second table you would like to merge
Aggregating data Select the columns you would like to join the tables on by clicking on the column from the first dataset, and from
the second datase
Data profiling
> Data Visualizations in Power BI Data Profiling is a feature in Power Query that provides intuitive information about your dat
> Why use Power BI? L ine Charts: Used for looking at a numeric value over time (e.g. revenue over time)
Easy to use—no coding Integrates seamlessly with Fast and can handle large Scatter: Displays one set of numerical data along the horizontal axis and another set along the vertical axis (e.g.
involved any data source datasets relation age and loan)
Combo Chart: Combines a column chart and a line chart (e.g. actual sales performance vs target) Data Analysis Expressions (DAX) is a calculation language used in Power BI that lets you create calculations and
Treemaps: Used to visualize categories with colored rectangles, sized with respect to their value (e.g. product perform data analysis. It is used to create calculated columns, measures, and custom tables. DAX functions are
Throughout this section, we’ll use the columns listed in this sample table of `sales_data`
There are three components to Power BI—each of them serving different purposes Maps: Used to map categorical and quantitative information to spatial locations (e.g. sales per state)
Cards: Used for displaying a single fact or single data point (e.g. total sales) deal_size sales_person date customer _name
P ow e r B I D e s k to p P ow e r B I s e r v i c e P ow e r B I m o b i l e
Free desktop application that Cloud-based version of Power BI A mobile app of Power BI, which Table: Grid used to display data in a logical series of rows and columns (e.g. all products with sold items) 1,000 Maria Shuttleworth 30-03-2022 Acme Inc.
provides data analysis and with report editing and publishing allows you to author, view, and 3,000 Nuno Rocha 29-03-2022 Spotflix
creation tools. features. share reports on the go.
2,300 Terence Mickey 13-04-2022 DataChamp
AVERAGE
ME AN DI
l (<co
l
l
adds all the numbers in a colum
(<co
umn>)
There are three main views in Power BI you connect to one or many data sources, shape and transform data to meet your needs, and load it into Power BI. M N MAX
I / l returns the smallest biggest value in a colum
(<co umn>) /
COUNT l (<cocounts the number of cells in a column that contain non blank value
umn>) -
report view da t a v i e w model view T NCTCOUNT
DIS I l counts the number of distinct values in a column.
(<co umn>)
This view is the default This view lets you examine This view helps you Open the Power Query Editor EX AM P LE
view, where you can datasets associated with establish different
Sum of all deals — SUM(‘sales_data’[deal_size]
visualize data and create your reports relationships between
While loading dat Average deal size — AVERAGE(‘sales_data’[deal_size]
reports datasets
Underneath the Home tab, click on Get Dat Distinct number of customers — DISTINCTCOUNT(‘sales_data’[customer_name])
Choose any of your datasets and double clic
Click on Transform Data
Logical unction
f
Under Queries in the Home tab of the ribbon, click on Transform Data drop-down, then on the Transform Data Create a column called large_deal that returns “Yes” if deal_size is bigger than 2,000 and “No” otherwise
button la e_deal sales_data deal_s e , es , N
Upload datasets into Power BI rg = IF( ‘ ’[ iz ] > 2000 “Y ” “ o”)
e t unction
Using the Power Query Editor
T x F
Choose any of your datasets and double clic LO W ER te t converts a text string to all lowercase letter
(< x >)
If you need to transform the data, click Transform which will launch Power Query. Keep reading this cheat sheet for You can remove rows dependent on their location, and propertie RE PL ACE ld_te t , sta t_
(<o x > , < _ a s , e _te t replaces part of a text string with a
r num> <num ch r > <n w x >)
how to apply transformations in Power Query Click on the Home tab in the Query ribbo different text string.
Inspect your data by clicking on the Data View Click on Remove Rows in the Reduce Rows group EX AM P LE
Choose which option to remove, whether Remove Top Rows, Remove Bottom Rows, etc. Change column customer_name be only lower case
Choose the number of rows to remov st e _ a e O ER sales_data st e _ a e
Create relationships in Power BI
cu om r n m = L W (‘ ’[cu om r n m ])
You can undo your action by removing it from the Applied Steps list on the right-hand side
SalesPersonID
Click on the Model View from the left-hand pan Click on the Add Column tab in the Query ribbo WEE A date ,
KD Y(< et _t e returns 1 corresponding to the day of the week of a date ( et
> <r urn yp >) -7 r urn _t e
yp
Connect key columns from different datasets by dragging one to another Click on Custom Column in the General grou indicates week start and end (1: Sunday Saturday, 2: Monday Sunday)
- -
(e.g., EmployeeID to e.g., SalespersonID) Name your new column by using the New Column Name optio
Employee Database EX AM P LE
Define the new column formula under the custom column formula using the available data
EmployeeID
Return the day of week of each deal
Replace values
week_da EE A sales_data
y = W KD Y(‘ ’[ date ] , 2)
You can replace one value with another value wherever that value is found in a colum
In the Power Query Editor, select the cell or column you want to replac
Click on the column or value, and click on Replace Values under the Home tab under the Transform grou
Fill the Value to Find and Replace With fields to complete your operation Learn Data Skills Online at www.DataCamp.com
OVERVIEW
What is Power BI? Power Query DAX Drill Down License
Works with data fetched from data sources using Language developed for data analysis. It enables the The Visual that supports the embedding of hierarchies Per-user License
"It is Microsoft’s Self-Service Business connectors. This data is then processed at the Power enables drilling down to the embedded hierarchy’s › Free—Can be obtained for any Microsoft work or school
creation of the following objects using expressions:
Intelligence tool for processing and BI app level and stored to an in-memory database in › Measures individual levels using the following symbols: email account. Intended
for personal use. Users with this license can only
analyzing data." the program background. This means that data is not › Calculated Columns
use the personal workspace. They cannot share
processed at the source level. The basic unit in Power › Calculated Tables Drill up to a higher-level hierarchy or consume shared content.
Query is query, which means one sequence Each expression starts with the = sign, followed
Components consisting of steps. A step is a data command that by links to tables/columns/functions/measures and Drill down to a specific field
"If it is not available in Premium workspace"
› Pro—It is associated with a work/school account priced at
dictates what should happen to the data when it is operators. The following operators are supported: €8.40 per month or it is included in the E5 license. Intended for
› Power BI Desktop—Desktop application › Arithmetic { + , - , / , * , ^ } Drill down to the next level in the hierarchy team collaboration. Let's users access team workspaces,
loaded into Power BI. The basic definition of each › Comparison { = , == , > , < , >= , <= , <> }
› Report—Multi-page canvas visible to end users. It serves consume shared content, and use apps.
for the placement of visuals, buttons, images, slicers, etc.
step is based on its use: › Text concatenation { & , && , II , IN } Expand next-level hierarchy › Premium per User – Includes all Power BI Pro license
› Data—Preview pane for data loaded into a model. › Connecting data—Each query begins with a function that › Precedence { ( , ) } capabilities, and adds features such as paginated reports, AI,
› Model—Editable scheme of relationships between tables in provides data for the subsequent steps. E.g., data can be Operators and functions require that all greater frequency for refresh rate, XMLA endpoint and other
loaded from Excel, SQL database, SharePoint etc. Connection
a model. Pages can be used in a model for easier navigation.
› Power Query—A tool for connecting, transforming,
steps can also be used later.
values/columns used are of the same data type
or of a type that can be freely converted; such
Tooltip/Custom Tooltip capabilities that are only available to Premium subscribers.
Per-tenant License
› Transforming data—Steps that modify the structure of the › Premium—Premium is set
and combining data. data. These steps include features such as Pivot Column, as a date or a number.
"Apart from the standard version, there is
› Tooltip —A default detail preview pane which up for individual workspaces. 0 to N workspaces
converting columns to rows, grouping data, splitting columns, appears above a visual when you hover over its values. can be used with a single version of this license. It provides
also a version for Report Server." removing columns, etc. Transformation steps are necessary in dedicated server computing power based
› Power BI Service—A cloud service enabling access
to, and sharing and administration of, output data.
order to clean data from not entirely clean data sources.
› Combining data—Data split into multiple source files needs Visualization on license type: P1, P2, P3, P4*, P5*. It offers more space for
datasets, extended metrics for individual workspaces,
› Workspace—There are three types of workspaces: to be combined so that it can be analyzed in bulk. Functions managed consumption of dedicated capacity, linking of Azure
include merging queries and appending queries. Visualizations or visuals let you present data in › Custom Tooltip —A custom tooltip is a custom- AI features with datasets, and access for users with Free
Personal, Team, and Develop a template app. They serve
as storage and enable controlled access to output data.
› Merge queries—This function merges queries based on the selected various graphical forms, from graphs to tables, designed report page identified as descriptive. licenses to shared
key. The primary query then contains a column which can be used to
› Dashboard—A space consisting of tiles in which visuals and extract data from a secondary query. Supports typical join types: maps, and values. Some visuals are linked to other When you hover over visual, a page appears content. Prices start at €4,212.30.
report pages are stored.* services outside Power BI, such as Power Apps. with content filtered based on criteria specified
*Only available upon special request. Intended for models larger than
100GB.
› Report—A report of pages containing visuals.*
› Worksheet—A published Excel worksheet. Can be used
by the value in the visual. › Embedded—Supports embedding dashboards and reports
in custom apps.
as a tile on a dashboard. › Report Server—Included in Premium or SQL Server Enterprise licenses.
› Dataset—A published sequence for fetching and
transforming data from Power BI Desktop. › Append query—Places the resulting data from one or more selected
Administration
queries under the primary query. In this case, data is placed in columns
› Dataflow—Online Power Query representing
with names that are an exact match. Non-matching columns form new
a special dataset outside of Power BI Desktop.* columns with a unique name in the primary query.
› Application—A single location combining one
In addition to basic visuals, Power BI supports › Use metrics—Usage metrics let you monitor Power BI usage
or more reports or dashboards.* › Custom function—A query intended to apply a pre-defined sequence of creating custom visuals. Custom visuals can be for your organization.
› Admin portal—Administration portal that lets you configure
Drill-through
steps so that the author does not need to create them repeatedly. The › Users—The Users tab provides a link to the Microsoft 365 admin center.
capacities, permissions, and capabilities for individual users custom function can also accept input data (values, sheets, etc.) to be used added using a file import or from a free Marketplace › Audit logs—The Audit logs tab provides a link to the Security &
and workspaces. in the sequence. offering certified and non-certified visuals. Compliance center.
› Parameter—Values independent of datasets. These values can then be › Tenant settings—Tenant settings enable fine-grained control over
*Can be created and edited in the Power BI Service Certification is optional, but it verifies whether, Drill-through lets you pass from a data overview
used in queries. Values enable the quick editing of a model because they features made available to your organization. It controls which features
environment. can be changed in the Power BI Service environment. among other things, a visual accesses external visual to a page with specific details. The target will be enabled or disabled and for which users and groups.
› Data Gateway—On-premises data gateway that lets you services and resources. page is displayed with all the applied filters affecting › Capacity settings—The Power BI Premium tab enables you to
transport data from an internal network or a custom device
to the Power BI Service.
Dataflow the value from which the drill-through originated.
manage any Power BI Premium and Embedded capacities.
› Embed codes—You can view the embed codes that are generated for
your tenant to share reports publicly. You can also revoke or delete codes.
› Power BI Mobile—Mobile app for viewing reports. Mobile
view is applied, if it exists, otherwise the desktop view is used.
The basic unit is a table or Entity consisting of Themes › Organization visuals—You can control which type of Power BI visuals
users can access across the organization.
› Report Server—On-premises version of Power BI Service.
columns or Fields. Just like Queries in Power › Azure connections—You can control workspace-level storage
› Report Builder—A tool for creating page reports. Query, Entities in Dataflows consist of sequences Serves as a single location for configuring all native permissions for Azure Data Lake Gen 2.
of steps. The result of such steps is stored in native › Workspaces—You can view the workspaces that exist in your tenant
graphical settings for visuals and pages. on the Workspaces tab.
Built-in and additional Azure Data Lake Gen 2.
"You can connect a custom Data Lake
› Custom branding—You can customize the look of Power BI for your
whole organization.
languages where the data will be stored." › Protection metrics—The report shows how sensitivity labels help
protect your content.
There are three types of entities: › Featured content—You can manage all the content promoted in the
By default, you can choose from 19 predefined
Built-in languages › Standard entity—It only works with data fetched directly Featured section.
› M/Query Language—Lets you transform data from a data source or with data from non-stored entities themes. Custom themes can be added.
Bookmarks
in Power Query.
› DAX (Data Analysis Expressions)—Lets you define custom
within the same dataflow.
Computed entity*—It uses data from another stored entity
A custom theme can be applied in two different ways:
› Modification of an existing theme—A native window that Bookmarks capture the currently configured view or a
External Tools
calculated tables, columns, and measures in Power BI Desktop. within the same dataflow. lets you modify a theme directly in the Power BI environment. report page visual. Later, you can go back to that state
"Both languages are natively available in Power BI, › Importing a JSON file—Any file you create only defines the They simplify the use of Power BI and extend the
› Linked entity*—Uses data from an entity located in another by selecting the saved bookmark. Setting options:
which eliminates the need to install anything." formatting that should change. Everything else remains the
› Data—Stores filters, applied sort order in visuals and slicers. capabilities offered in Power BI. These tools are
dataflow. If data in the original entity is updated, same. The advantage of this approach is that you can
the new data is directly passed to all By selecting the bookmark, you can re-apply the corresponding mostly developed by the community. Recommended
Additional languages customize any single visual.
settings. external tools:
› Python—Lets you fetch data and create visuals. linked entities.
*Can only be used in a dedicated Power BI Premium workspace. › Display—Stores the state of the display for visuals and › Tabular Editor
Requires installation of the Python language on your "The resulting theme can be exported in the JSON format and report elements (buttons, images, etc.). By selecting the › DAX studio
computer and enabling Python scripting. used in any report without the need to create a theme from
"It supports custom functions as well as parameters." bookmark, you can go back to the previously stored state
› R—Lets you fetch and transform data and create visuals. scratch." of the display. › ALM Toolkit
Requires installation of the R language on your computer
› Current page—Stores the currently displayed page. By › VertiPaq Analyzer
and enabling R scripting.
selecting the bookmark, you can go back the to stored page.
What is Power Query? Data values let expression Custom function Syntax Sugar
Each value type is associated with a literal syntax, a set of values The expression let is used to capture the value from an Example of custom function entries: › Each is essentially a syntactic abbreviation for declaring non-
“An IDE for M development“ of that type, a set of operators defined above that set of values, intermediate calculation in a named variable. These named (x, y) => Number.From(x) + Number.From(y) type functions, using a single formal parameter named.
and an internal type attributed to the newly created values. variables are local in scope to the `let` expression. The Therefore, the following notations are semantically
equivalent:
Components
› Null – null construction of the term let looks like this: (x) =>
› Logical – true, false let let let
› Number – 1, 2, 3, ... name_of_variable = <expression>, out = Number.From(x) + Source = ...,
› Ribbon – A ribbon containing settings and pre-built features by Power › Time – #time(HH,MM,SS) Number.From(Date.From(DateTime.LocalNow()))
addColumn = Table.AddColumn(Source, „NewName“, each [field1] + 1)
Query itself rewrites in M language for user convenience. returnVariable = <function>(name_of_variable) in
› Date – #date(yyyy,mm,ss) in in
› Queries – simply a named M expression. Queries can be moved into addColumn
› DateTime – #datetime(yyyy,mm,dd,HH,MM,SS) returnVariable out ------------------------------------------------------------------------------------------------------------------------------------------------------------------
groups
› DateTimeZone – let
› Primitive – A primitive value is a single-part value, such as a number, When it is evaluated, the following always applies: The input argumets to the functions are of two types: Source = ...,
#datetimezone(yyyy,mm,dd,HH,MM,SS, 9,00)
logical, date, text, or null. A null value can be used to indicate the absence › Required – All commonly written argumets in (). Without add1ToField1 = (_) => [field1] + 1,
of any data. › Duration – #duration(DD,HH,MM,SS) › Expressions in variables define a new range containing
these argumets, the function cannot be called. addColumn(Source,“NewName“,add1ToField1)
› List – The list is an ordered sequence of values. M supports endless lists. › Text – “text“ identifiers from the production of the list of variables and must in
› Optional – Such a parameter may or may not be to function to
Lists define the characters “{“ and “}“ indicate the beginning and the end of › Binary – #binary(“link“) be present when evaluating terms within a list variables. The
enter. Mark the parameter as optional by placing text before The second piece of syntax sugar is that bare square brackets are syntax
the list. › List – {1, 2, 3} expressions in the list of variables are they can refer to each
the argument name “Optional“. For example (optional x). If it sugar for field access of a Record named `_`.
› Record – A record is a set of fields, where the field is a pair of which form › Record – [ A = 1, B = 2 ] other
the name and value. The name is a text value that is in the field record does not happen fulfillment of an optional argument, so be the
unique.
› Table – A table is a set of values arranged in named columns and rows.
› Table – #table({columns},{{first row contenct},{}…})*
› Function – (x) => x + 1
› All variables must be evaluated before the term let is evaluated.
› If expressions in variables are not available, let will not be
same for for calculation purposes, but its value will be null. Query Folding
Optional arguments must come after required arguments.
Table can be operated on as if it is a list of records, or as if it is a record of
› Type – type { number }, type table [ A = any, B = text ] evaluated
* The index of the first row of the table is the same as for the records in sheet 0 › Errors that occur during query evaluation propagate as an error As the name implies, it is about composing. Specifically, the
lists. Table[Field]` (field reference syntax for records) returns a list of values in Arguments can be annotated with `as <type>` to indicate
that field. `Table{i}` (list index access syntax) returns a record representing a to other linked queries. steps in Power Query are composed into a single query, which
required type of the argument. The function will throw a type
row of the table. is then implemented against the data source. Data sources
› Function – A function is a value that when called using arguments creates a Operators error if called with arguments of the wrong type. Functions can
that supports Query folding are resources that support the
new value. Functions are written by listing the function argumets in
parentheses, followed by the transition symbol “=>“ and the expression
Conditions also have annotated return of them. This annotation is provided
as:
concept of query languages as relational database sources.
There are several operators within the M language, but not every This means that, for example, a CSV or XML file as a flat file
defining the function. This expression usually refers to argumets by (x as number, y as text) as logical => <expression>
name. There are also functions without argumets.
operator can be used for all types of values. Even in Power Query, there is an “If“ expression, which, based with data will definitely not be supported by Query Folding.
› Parameter – The parameter stores a value that can be used for › Primary operators on the inserted condition, decides whether the result will be a The return of the functions is very different. The output can be a Therefore, the transformation does not have to take place
transformations. In addition to the name of the parameter and the value it › (x) – Parenthesized expression true-expression or a false-expression. sheet, a table, one value but also other functions. This means until after the data is loaded, but it is possible to get the data
stores, it also has other properties that provide metadata. The undeniable › x[i] – Field Reference. Return value from record, list of values that one function can produce another function. Such a function ready immediately. Unfortunately, not every source supports
Syntactic form of If expression:
advantage of the parameter is that it can be changed from the Power BI from table. is written as follows: this feature.
Service environment without the need for direct intervention in the data if <predicate> then < true-expression > else < false-expression >
› x{i} – Item access. Return value from list, record from table. › Valid functions
set. Syntax of parameter is as regular query only thing that is special is that “else is required in M's conditional expression “ let first = (x)=> () => let out = {1..x} in out in first
“Placing the “?“ Character after the operator returns null if the › Remove, Rename columns
the metadata follows a specific format.
index is not in the list “ Condition entry: When evaluating functions, it holds that: › Row filtering
› Formula Bar – Displays the currently loaded step and allows you to edit › Grouping, summarizing, pivot and unpivot
› x(…) – Function invocation If x > 2 then 1 else 0 › Errors caused by evaluating expressions in a list of
it.To be able to see formula bar, It has to be enabled in the ribbon menu › Merge and extract data from queries
inside View category. › {1 .. 10} – Automatic list creation from 1 to 10 If [Month] > [Fiscal_Month] then true else false expressions or in a function expression will propagate › Connect queries based on the same data source
› Query settings – Settings that include the ability to edit the name and › … – Not implemented If expression is the only conditional in M. If you have multiple further either as a failure or as an “Error“ value › Add custom columns with simple logic
description of the query. It also contains an overview of all currently applied › Mathematical operators – +, -, *, / predicates to test, you must chain together like: › The number of arguments created from the argument › Invalid functions
steps. Applied Steps are the variables defined in a let expression and they › Comparative operators › Merge queries based on different data sources
if <predicate> list must be compatible with the formal argumets of
are represented by varaibles names. › Adding columns with Index
› > , >= – Greater than, greater than or equal to then < true-expression > the function, otherwise an error will occur with reason
› Data preview – A component that displays a preview of the data in the › Change the data type of a column
currently selected transformation step.
› < , <= – Less than, less than or equal to else if <predicate> code “Expression.Error“
› = , <> – is equal, is not equal. Equal returns true even for then < false-true-expression >
DEMO
› Status bar – This is the bar located at the bottom of the screen. The row
contains information about the approximate state of the rows, columns,
and time the data was last reviewed. In addition to this information, there is
null = null
› Logical operators
else < false-false-expression >
When evaluating the conditions, the following applies:
Recursive functions
profiling source information for the columns. Here it is possible to switch › and – short-circuiting conjunction › Operators can be combined. For example, as follows:
the profiling from 1000 rows to the entire data set. › or – short-circuiting disjunction › If the value created by evaluating the if a condition is not a For recursive functions is necessary to use the character “@“
logical value, then an error with the reason code
› LastStep[Year]{[ID]}
› not – logical negation which refers to the function within its calculation. A typical
*This means that you can get the
Functions in Power Query › Type operators “Expression.Error„ is raised recursive function is the factorial. The function for the factorial
value from another step based on the index of the column
› as – Is compatible nullable-primitive type or error › A true-expression is evaluated only if the if condition can be written as follows:
Knowledge of functions is your best helper when working with › is – Test if compatible nullable-primitive type evaluates to true. Otherwise, false-expression is evaluated. let › Production of a DateKey dimension goes like this:
a functional language such as M. Functions are called with › Metadata - The word meta assigns metadata to a value. › If expressions in variables are not available, they must not be Factorial = (x) => #table(
parentheses. Example of assigning metadata to variable x: evaluated if x = 0 then 1 else x * @Factorial(x - 1), type table [Date=date, Day=Int64.Type, Month=Int64.Type,
› Shared – Is a keyword that loads all functions “x meta y“ or “x meta [name = x, value = 123,…]“ › The error that occurred during the evaluation of the condition Result = Factorial(3) MonthName=text, Year=Int64.Type,Quarter=Int64.Type],
(including help and example) and enumerators in Within Power Query, the priority of the operators applies, so for example will spread further either in the form of a failure of the entire in List.Transform(
result set. The call of function is made inside empty “X + Y * Z“ will be evaluated as “X + (Y * Z)“ query or “Error“ value in the record. Result // = 6 List.Dates(start_date, (start_date-endd_ate),
query using by = # shared #duration(1, 0, 0 ,0)),
each {_, Date.Day(_), Date.Month(_),
Comments The expression try… otherwise Each Date.MonthName(_), Date.Year(_), Date.QuarterOfYear(_)}
))
Functions can be divided into two categories: Capturing errors is possible, for example, using the try Functions can be called against specific arguments. However, if
M language supports two versions of comments:
› Prefabricated – Example: Date.From() expression. An attempt is made to evaluate the expression
Keywords
› Single-line comments – can be created by // before code the function needs to be executed for each record, an entire
› Custom – these are functions that the user himself prepares after the word try. If an error occurs during the evaluation, the sheet, or an entire column in a table, it is necessary to append
for the model by means of the extension of the notation by › Shortcut: CTRL + ´ expression after the word otherwise is applied the word each to the code. As the name implies, for each
„()=> “, where the argumets that will be required for the › Multi-line comments – can be created by /* before code and and, as, each, else, error, false, if, in, is, let, meta, not,
Syntax example: context record, it applies the procedure behind it. Each is never
evaluation of the function can be placed in parentheses. */ after code otherwise, or, section, shared, then, true, try, type, #binary,
try Date.From([textDate]) otherwise null required! It simply makes it easier to define a function in-line
When using multiple argumets, it is necessary to separate › Shortcut: ALT + SHIFT + A #date, #datetime, #datetimezone, #duration, #infinity, #nan,
for functions which require a function as their argument.
them using a delimiter. #sections, #shared, #table, #time
Learn Data Visualization online at www.DataCamp.com One of the most common ways to
show part to whole data. It is also
The donut pie chart is a variant of the
pie chart, the difference being it has a
Heatmaps are two-dimensional charts
that use color shading to represent
Best to compare subcategories within
categorical data. Can also be used to
2D rectangles whose size is
proportional to the value being
commonly used with percentages hole in the center for readability data trends. compare percentages measured and can be used to display
hierarchically structured data
Use cases Use cases Use cases Use cases Use cases
> Capture a trend > Visualize a single value > Capture distributions
Line chart Multi-line chart Area chart Stacked area chart Spline chart Card Table chart Gauge chart Histogram Box plot Violin plot Density plot
$7.47M
Total Sales
Cards are great for showing Best to be used on small This chart is often used in Shows the distribution of a Shows the distribution of a A variation of the box plot.
Visualizes a distribution by
The most straightforward way to Captures multiple numeric Shows how a numeric value Most commonly used variation of Smoothened version of a line chart. and tracking KPIs in datasets, it displays tabular executive dashboard reports variable. It converts variable using 5 key It also shows the full using smoothing to allow
capture how a numeric variable is variables over time. It can include progresses by shading the area area charts, the best use is to track It differs in that data points are dashboards or presentations data in a table
to show relevant KPIs numerical data into bins as summary statistics— distribution of the data smoother distributions and
changing over time multiple axes allowing comparison between line and the x-axis the breakdown of a numeric value connected with smoothed curves columns. The x-axis shows minimum, first quartile, alongside summary statistics better capture the
of different units and scale ranges by subgroups to account for missing values, as the range, and the y-axis median, third quartile, and distribution shape of the data
opposed to straight lines represents the frequency maximum
Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases
Revenue in $ over tim Apple vs Amazon stocks Total sales over tim Active users over time by Electricity consumption over Revenue to date on a Account executive NPS score Distribution of salaries in Gas efficiency of vehicle Time spent in restaurants Distribution of price of
Energy consumption in kWh over tim Active users over time segmen tim sales dashboar leaderboard Revenue to target an organizatio Time spent reading across across age group hotel listing
over tim Lebron vs Steph Curry Total revenue over time by CO2 emissions over time Total sign-ups after a Registrations per webinar Distribution of height in readers Length of pill effects by Comparing NPS scores by
Google searches over time searches over tim country promotion one cohort dose customer segment
Bitcoin vs Ethereum price
over time
Data Analyst
Science
Engineer
One of the easiest charts to Also known as a vertical bar Most commonly used chart A hybrid between a scatter Often used to visualize data A convenient visualization for Useful for representing flows in Useful for presenting Similar to a graph, it
read which helps in quick
comparison of categorical
chart, where the categories
are placed on the x-axis.
when observing the
relationship between two
plot and a line plot, the
scatter dots are connected
points with 3 dimensions,
namely visualized on the x-
visualizing the most prevalent
words that appear in a text
systems. This flow can be any
measurable quantity
weighted relationships or
flows between nodes.
consists of nodes and
interconnected edges. It
Learn Data Skills Online at
data. One axis contains These are preferred over bar variables. It is especially with a line axis, y-axis, and with the size Especially useful for illustrates how different www.DataCamp.com
categories and the other axis charts for short labels, date useful for quickly surfacing of the bubble. It tries to show highlighting the dominant or items have relationships
represents values ranges, or negatives in values potential correlations relations between data points important flows
with each other
between data points using location and size
Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases
Volume of google Brand market shar Display the relationship Cryptocurrency price Adwords analysis: CPC vs Top 100 used words by Energy flow between Export between countries How different airports are
searches by regio Profit Analysis by region between time-on-platform inde Conversions vs Share of customers in customer countrie to showcase biggest connected worldwide
Market share in revenue and chur Visualizing timelines and total conversion service tickets Supply chain volumes export partner Social media friend group
by product Display the relationship events when analyzing Relationship between life between warehouses Supply chain volumes analysis
between salary and years two variables expectancy, GDP per between the largest
spent at company capita, & population size warehouses
Deviation Correlation Ranking Distribution Change over Time Part-to-whole Magnitude Spatial Flow
Emphasise variations (+/-) from a Show the relationship between two or Use where an item’s position in an Show values in a dataset and how Give emphasis to changing trends. Show how a single entity can be Show size comparisons. These can be Used only when precise locations or Show the reader volumes or intensity
fixed reference point. Typically the more variables. Be mindful that, unless ordered list is more important than its often they occur. The shape (or ‘skew’) These can be short (intra-day) broken down into its component relative (just being able to see geographical patterns in data are of movement between two or more
reference point is zero but it can also you tell them otherwise, many readers absolute or relative value. Don’t be of a distribution can be a memorable movements or extended series elements. If the reader’s interest is larger/bigger) or absolute (need to more important to the reader than states or conditions. These might be
be a target or a long-term average. will assume the relationships you afraid to highlight the points of way of highlighting the lack of traversing decades or centuries: solely in the size of the components, see fine differences). Usually these anything else. logical sequences or geographical
Can also be used to show sentiment show them to be causal (i.e. one interest. uniformity or equality in the data. Choosing the correct time period is consider a magnitude-type chart show a ‘counted’ number (for example, locations.
(positive/neutral/negative). causes the other). important to provide suitable context instead. barrels, dollars or people) rather than
for the reader. a calculated rate or per cent. Example FT uses
Locator maps, population density, Example FT uses
Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses natural resource locations, natural Movement of funds, trade, migrants,
Trade surplus/deficit, climate change Inflation & unemployment, income & Wealth, deprivation, league tables, Income distribution, population Share price movements, economic Fiscal budgets, company structures, Commodity production, market disaster risk/impact, catchment areas, lawsuits, information; relationship
life expectancy constituency election results (age/sex) distribution time series national election results capitalisation variation in election results graphs.
Diverging bar Scatterplot Ordered bar Histogram Line Stacked column Column Basic choropleth (rate/ratio) Sankey
A simple standard The standard way to Standard bar charts The standard way to The standard way to A simple way of The standard way to The standard approach Shows changes in flows
bar chart that can show the relationship display the ranks of show a statistical show a changing time showing part-to-whole compare the size of for putting data on a from one condition to
handle both negative between two values much more distribution - keep the series. If data are relationships but can things. Must always map – should always be at least one other; good
and positive continuous variables, easily when sorted gaps between columns irregular, consider be difficult to read with start at 0 on the axis. rates rather than totals for tracing the eventual
magnitude values. each of which has its into order. small to highlight the markers to represent more than a few and use a sensible base outcome of a complex
own axis. ‘shape’ of the data. data points. components. geography. process.
Diverging stacked bar Line + Column Ordered column Boxplot Column Proportional stacked bar Bar Proportional symbol (count/magnitde) Waterfall
Perfect for A good way of See above. Summarise multiple Columns work well A good way of See above. Good Use for totals rather Designed to show the
presenting survey showing the distributions by for showing change showing the size and when the data are not than rates – be wary sequencing of data
results which involve relationship between showing the median over time - but proportion of data at time series and labels that small differences through a flow
sentiment (eg an amount (columns) (centre) and range of usually best with only the same time – as have long category in data will be hard to process, typically
disagree/neutral/ and a rate (line). the data one series of data at long as the data are names. see. budgets. Can include
agree). a time. not too complicated. +/- components.
Spine chart Connected scatterplot Ordered proportional symbol Violin plot Line + column Pie Paired column Flow map Chord
Splits a single value Usually used to show Use when there are big Similar to a box plot A good way of A common way of As per standard For showing A complex but
into 2 contrasting how the relationship variations between but more effective with showing the showing part-to-whole column but allows for unambiguous powerful diagram
components (eg between 2 variables values and/or seeing complex distributions relationship over time data – but be aware multiple series. Can movement across a which can illustrate
Male/Female). has changed over fine differences (data that cannot be between an amount that it’s difficult to become tricky to read map. 2-way flows (and net
time. between data is not so summarised with (columns) and a rate accurately compare the with more than 2 winner) in a matrix.
important. simple average). (line). size of the segments. series.
Surplus/deficit filled line Bubble Dot strip plot Population pyramid Stock price Donut Paired bar Contour map Network
The shaded area of Like a scatterplot, but Dots placed in order A standard way for Usually focused on Similar to a pie chart – See above. For showing areas of Used for showing
these charts allows a adds additional detail on a strip are a showing the age and day-to-day activity, but the centre can be equal value on a map. the strength and
balance to be shown by sizing the circles space-efficient sex breakdown of a these charts show a good way of making Can use deviation inter-connectdness
– either against a according to a third method of laying out population distribution; opening/closing and space to include more colour schemes for of relationships of
baseline or between variable. ranks across multiple effectively, back to back hi/low points of each information about the showing +/- values varying types.
two series. categories. histograms. day. data (eg. total).
XY heatmap Slope Dot strip plot Slope Treemap Proportional stacked bar Equalised cartogram
A good way of showing Perfect for showing Good for showing Good for showing Use for hierarchical A good way of Converting each unit on
the patterns between 2 how ranks have individual values in a changing data as long part-to-whole showing the size and a map to a regular and
categories of data, less changed over time or distribution, can be a as the data can be relationships; can be proportion of data at equally-sized shape –
good at showing fine vary between problem when too simplified into 2 or 3 difficult to read when the same time – as good for representing
differences in amounts. categories. many dots have the points without missing there are many small long as the data are voting regions with
same value. a key part of story. segments. not too complicated. equal value.
Lollipop chart Dot plot Area chart Scaled cartogram (value)
Voronoi Proportional symbol
Lollipops draw more A simple way of Use with care – these Stretching and
attention to the data showing the change are good at showing A way of turning Use when there are
shrinking a map so
value than standard or range (min/max) changes to total, but points into areas – big variations between
that each area is
bar/column and can of data across seeing change in any point within each values and/or seeing
sized according to a
also show rank and multiple categories. components can be area is closer to the fine differences
particular value.
value effectively. very difficult. central point than between data is not so
any other centroid. important.
Barcode plot Fan chart (projections) Sunburst Isotype (pictogram) Dot density
Like dot strip plots, Use to show the Another way of Excellent solution in Used to show the
good for displaying uncertainty in future visualisaing hierarchical some instances – use location of individual
all the data in a projections - usually part-to-whole only with whole events/locations –
table,they work best this grows the further relationships. Use numbers (do not slice make sure to annotate
when highlighting forward to projection. sparingly (if at all) for off an arm to any patterns the
individual values. obvious reasons. represent a decimal). reader should see.
Visual
Calendar heatmap Gridplot Radar chart
A great way of Good for showing % A space-efficient way
showing temporal information, they of showing value pf
patterns (daily, weekly, work best when used multiple variables– but
monthly) – at the on whole numbers make sure they are
expense of showing and work well in organised in a way that
precision in quantity. multiple layout form. makes sense to reader.
vocabulary
Priestley timeline Venn Parallel coordinates
Great when date and Generally only used An alternative to radar
duration are key for schematic charts – again, the
elements of the story representation. arrngement of the
in the data. variables is important.
Usually benefits from
highlighting values.
ft.com/vocabulary
8. Emerging Technologies
Top 10 Technology Trends
Artificial Intelligence (AI) Cyber Security
Simulation of human intelligence processes by machines, Application of technologies, processes and controls to protect
which includes machine, deep, and reinforcement learning, systems, networks, programs, devices and data from cyber
attacks
natural language processing and computer vision
Blockchain
Internet of Things (IoT) Technology that facilitates the process of recording
Networking capability that allows information to
1 2 transactions and tracking assets in a business
be sent to and received from objects and devices network, where an asset can be tangible ( house,
(things).
10 3 car, land) or intangible (patents, copyrights,
branding)
Cloud Computing 7 6
Delivery of different services through the Internet,
including data storage, servers, databases,
5th Generation Network (5G)
5G is the fifth generation of wireless technology,
networking, and software.
which can provide higher speed, lower latency and
greater capacity than 4G LTE networks
Quantum Computing 3D Printing
Machines that use the properties of quantum physics to store data Also known as additive manufacturing, is a method of
and perform computations, which can result in a computer 158 Prepered by: Dr. Mejdal creating a three dimensional object layer-by-layer using a
million times faster than the most sophisticated supercomputer we computer created design.
have in the world today
8.1 Artificial Intelligence (AI)
Machine Learning Algorithm Cheat Sheet
This cheat sheet helps you choose the best machine learning algorithm for your predictive analytics solution.
Your decision is driven by both the nature of your data and the goal you want to achieve with your data.
Extract N-Gram Creates a dictionary of n-grams Multiclass Decision Accuracy, fast training times
Features from Text from a column of free text Forest
Predict between
One-vs-All Depends on the
Converts text data to integer two categories two-class classifier
Multiclass
Feature Hashing encoded features using the
Vowpal Wabbit library Generate recommendations One-vs-One Depends on binary classifier,
less sensitive to an imbalanced
Multiclass dataset with larger complexity
Performs cleaning operations on text,
Preprocess Text like removal of stop-words, case Recommenders Multiclass Boosted Non-parametric, fast
normalization Decision Tree training times and scalable
Predicts what someone will be interested in
Converts words to values for use in
Word2Vector Answers the question: What will they be interested in? Two-Class Classification
NLP tasks, like recommender, named
entity recognition, machine Use the Train Wide & Deep Hybrid recommender, both collaborative
translation Answers simple two-choice questions,
Recommender module filtering and content-based approach
like yes or no, true or false
Collaborative filtering, better performance Answers questions like: Is this A or B?
SVD Recommender
with lower cost by reducing dimensionality Two-Class Support Under 100 features,
Predict Vector Machine linear model
Regression
values
Makes forecasts by estimating the Two-Class Averaged
Discover structure Perceptron
Fast training, linear model
relationship between values
Answers questions like: How much or how many? Two-Class Decision Accurate, fast training
Clustering Forest
Fast Forest Quantile
Predicts a distribution
Regression Separates similar data points into intuitive groups Two-Class Logistic
Fast training, linear model
Answers questions like: How is this organized? Regression
Poisson Regression Predicts event counts
K-Means Unsupervised learning Classify Two-Class Boosted Accurate, fast training,
Decision Tree large memory footprint
images
Linear Regression Fast training, linear model
Two-Class Neural Accurate, long training
Bayesian Linear Find unusual occurrences Network times
Linear model, small data sets
Regression
Image Classification
Decision Forest
Regression
Accurate, fast training times Anomaly Detection Classifies images with popular networks
Answers questions like: What does this image represent?
Neural Network Identifies and predicts rare or unusual data points
Accurate, long training times
Regression Answers the question: Is this weird? ResNet Modern deep
learning neural
Boosted Decision Accurate, fast training times, Under 100 features, PCA-Based Anomaly
One Class SVM Fast training times DenseNet network
Tree Regression large memory footprint aggressive boundary Detection
© 2021 Microsoft Corporation. All rights reserved. Share this poster: aka.ms/mlcheatsheet
8.2 Internet of Things (IoT)
Ch
ea
t sh
ee
t
Internet of Things
The Internet of Things (IoT) refers to a system of interrelated, internet-connected objects that can collect and
transfer data over a wireless network without human intervention. IoT devices are basically "smart" devices that
make "smart" systems (for e.g., smart home, smart factory, smart farms, smart cities, etc.).
OBJECT
outside the local system using Virtual Machines or
Citrix. Hence, the tool must be platform independent
CHEAT SHEET
STUDIO and support any type of application
Applications PROCESS DOCUMENTS
NO PHYSICAL
BUSINESS VIRTUAL STUDIO • Scalability: How quickly and easily the tool responds EVERY STEP
BOT
User Interface BUSINESS CONSISTENTLY
LOGIC
OBJECT (VBO) to the business requirements
BUSINESS
VIRTUAL
BLUE PRISM
• Security: Implementations of security controls must
User Interface BUSINESS
RPA Basics
LOGIC PROCESS
OBJECT (VBO) be measured
CONTROL
BUSINESS
LOGIC
User Interface
VIRTUAL
BUSINESS
OBJECT (VBO)
BLUE PRISM
DATABASE
ROOM
SYSTEM
• Total Cost: The initial set-up cost, vendor license fee,
maintenance cost must be taken into consideration
RPA
MANAGER
while selecting the tool
RPA • Ease of use and control: The tool must be user
TIME-TO-
Automation friendly to increase the efficiency and employee
NO CHANGES IN
THE EXISTING
INFRASTRUCTU
MARKET
WITHIN A
RPA is a Robotic Process Automation which is used for automating the current workflows satisfaction RE NEEDED FEW WEEKS
with the help of robots to reduce human intervention at every point Anywhere • Vendor Experience: It improves the speed of
Robotic: Machines that mimics the human activities and actions are called as robots The architecture consists of client, bots and control room. Automation implementation by reducing the work required to
Process: Sequence of steps which is used to perform a particular task Anywhere is a software designed to virtually automate any computer implement RPA software USES THE
EXISTING
process. SQL express is a tool where data is stored. • Maintenance and support: To make sure that the APPLICATION
Automation: Any process which is done by robot without any human intervention and
required service level agreements are met
provides high degree of accuracy • Quick Deployment: The tool must be able to work as
BOT CREATOR AND DESKTOP BOT
APPLICATION RUNNER a real end-user by interacting with applications.
Global BOT 1
VENDOR Free Version Pricing Usability
Coverage
DEVELOPER 1
CONTROL ROOM
Applications
Another Monday ---- ---- ---- Europe DEVELOPER 2 BOT 2
Benefits of of RPA
AntWorks ---- ---- ---- ---- BOT 3
DEVELOPER 3 •
•
User Management Using RPA • It is used in customer service, to automate service order
Arago ---- ---- ---- ---- Source Control
management and quality reporting
• Dashboard
Automation Drag & Drop, • License Management • Reduces burden on IT: It does not disturb underlying • Travel and logistics: for ticket booking, passenger details and
---- Per Process Global
Anywhere Macro Recording legacy systems accounting
BluePrism ---- Per bot Drag & Drop ---- Ui Path • Reliability: As the bots can work 24*7 effectively • Human Resource: New employees joining formalities, payroll
Contextor EMEA & North • Cost cutting technology: It reduces the costs by process and hiring shortlisted candidates
---- ---- ---- It is a simple software automation and application integration expert.
reducing the size of manual workforce • Health care: In patient registrations and billing
America
UiPath consists of three parts • No coding required: To use RPA tools, a person need • Banking and Financial services: It can be used for card activation
Jidoka ---- ---- ---- ----
and fraud claims and discovery
Kofax ---- ---- ---- Global not have the programming skills
• UiPath studio • Government: Change of address and license renewal
• Accuracy: It functions with accuracy and is less prone
Kryon Systems ---- ---- ---- ----
• UiPath robots to errors
Nice systems ---- ---- ---- Global
• UiPath orchestrator • Productivity rate: Execution time is much faster than
Pega ---- ---- ---- Global
the manual process approach
Redwood Company
---- ---- ---- ---- directory server • Compliance: It follows the rules to provide audit free
software Browser (desktop) Mobile devices
trail
UiPath Community Drag & Drop, Information tier
UiPath Per bot Global • Consistency: Repetitive tasks are performed in the
Edition Macro Recording PORTAL
Logic tier same way
Visual Cron 45 day free trial Per server Drag & Drop ----
OASP sample
--- Access Manager • Increase employee engagement: It lets the employee FURTHERMORE:
WorkFusion Drag & Drop, application
Data tier
WorkFusion Per process Global to focus on value-added activities RPA Training using UiPath
RPA Express Macro Recording DB DB External Users Db
8.4 Cloud Computing
Cloud Computing – Summary Cheat sheet Version 2
CONTROL
“Cloud Computing is an approach to offer IT Services to customers remotely is called Suitable for Developers
Cloud Computing” AWS, Google Apps Eng,..
PaaS
For Net Arch’s, IT Admins
www.networkwalks.com
Internal Cloud External Cloud STaaS SECaaS DaaS TEaaS APIaaS FaaS
Introduction 3D Printing
Source: https://www.medicaldesignandoutsourcing.com/3d-printing-
options-medical-device-development/ Fused Deposition Modeling (FDM)
Direct metal laser sintering (DMLS) uses a fiber laser system that Multi Jet Fusion process selectively applies fusing and detailing
draws onto a surface of atomized metal powder, welding the powder agents across a bed of nylon powder, which are fused in thousands
into fully dense metal parts. DMLS builds fully functional metal of layers by heating elements into a solid functional component.
prototypes and production parts and works well to reduce metal Final parts exhibit improved surface roughness, fine feature resolu‐
components in multipart assemblies. tion, and more isotropic mechanical properties when compared to
processes like SLS.
PolyJet
• Advantages of 5G network
8.8 Extended Reality (XR)
Ch
ea
t sh
ee
t
Immersive Tech
Immersive technologies create experiences by merging the physical world with a digital / simulated
reality. Augmented reality (AR) and virtual reality (VR) are the two main types of immersive tech. AR
blends computer-generated information onto the user’s real environment. VR uses computer-
generated information to provide a full sense of immersion.
Using blockchain
enhanced the
transparency of
product information,
reduced food fraud,
and waste. It
Used blockchain increased food safety
technology and and profitability
Food and Beverage How to create a safer included a QR code on among food suppliers.
Traceability – and more transparent food and beverage And it improved
Showcasing the truth supply chain? products to allow the consumer information,
of your product. customer to see boosting trust & sales.
exactly how the
product was made. See how a County
Down enterprise uses
blockchain to reveal
everything that a
consumer would
want to know about
their beer!
8.10 Cyber Security
Cyber Security
Quick Reference Guide Free Cheat Sheets
Visit ref.customguide.com
Businesses worldwide are at risk for security breaches. While large, The first line of defense in maintaining
well-known companies seem like a likely target, small and medium- system security is using complex
sized organizations and individuals are also at risk. There are many passwords. Use passwords that are at least
ways data can be compromised, including viruses, phishing scams, 8 characters long and include a
hardware and software vulnerabilities, and network security holes. combination of numbers, upper and
lowercase letters, and special characters.
Did you know? Hackers have tools that can break easy
passwords in just a few minutes.
Malware is short for "malicious software." It is written to infect the host computer. Common types of malware include:
Replicating computer Hijacks your computer or Secretly tracks your Malicious program that
program that infects browser and displays internet activities and tries to trick you into
computers annoying advertisements information running it
Browsers communicate to websites with a protocol called HTTP, • Use Wi-Fi password security and
which stands for Hyper Text Transfer Protocol. HTTPS is the secure change the default password
version of HTTP. Websites that use HTTPS encrypt all communication • Set permissions for shared files
between your browser and the site.
• Only connect to known, secure public
Wi-Fi and ensure HTTPS-enabled sites
are used for sensitive data
Secure sites have an indicator, Sites without HTTPS are not • Keep your operating
like a padlock, in the address bar secure and should never be system updated
to show the site is secure. You used when dealing with personal • Perform regular
should always ensure security data. If you are simply reading
security checks
when logging in or transferring an article or checking the
confidential information. weather, HTTP is acceptable. • Browse smart!
A phishing email tries to trick consumers into providing confidential data to steal money or information. These emails
appear to be from a credible source, such as a bank, government entity, or service provider. Here are some things to
look for in a phishing email:
Sender’s Address
Grammatical Errors
The address should
Spelling mistakes and
be correlated with the
poor grammar
sender