Executive Summary of AI and ET

Download as pdf or txt
Download as pdf or txt
You are on page 1of 154

Executive Summary of Artificial Intelligence and

Emerging Technologies

Prepared by:
Dr. Mejdal Alqahtani
Table of contents

1. Overview on AI
2. Basic math and probabilities
3. Machine learning
3.1. Supervised learning
3.2. Unsupervised learning
4. Deep learning
5. Reinforcement learning
6. Programing languages
6.1. Python language
6.2. R language
6.3. SQL language
7. Business Intelligence
7.1. Tableau
7.2. Power BI
7.3. Data Visualization
8. Emerging Technologies
8.1. Artificial Intelligence (AI)
8.2. Internet of Things (IoT)
8.3. Robotic Process Automation (RPA)
8.4. Cloud Computing
8.5. Quantum Computing
8.6. 3D Printing
8.7. 5th Generation Network (5G)
8.8. Extended Reality (XR)
8.9. Blockchain
8.10. Cyber Security
1. Overview on AI
2. Basic math and probabilities
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Refresher: Linear Algebra and Calculus • outer product: for x ∈ Rm , y ∈ Rn , we have:

 x1 y1 ··· x1 yn 
Afshine Amidi and Shervine Amidi xy T = .. .. ∈ Rm×n
. .
xm y1 ··· xm yn
October 6, 2018
r Matrix-vector multiplication – The product of matrix A ∈ Rm×n and vector x ∈ Rn is a
vector of size Rm , such that:
General notations
aT
 
r,1 x n
r Vector – We note x ∈ Rn a vector with n entries, where xi ∈ R is the ith entry:
..
X
x1 ! Ax =  = ac,i xi ∈ Rm
x2 .
x= .. ∈ Rn
T
ar,m x i=1
.
xn
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
where aT
r Matrix – We note A ∈ Rm×n a matrix with m rows and n columns, where Ai,j ∈ R is the of x.
entry located in the ith row and j th column: r Matrix-matrix multiplication – The product of matrices A ∈ Rm×n and B ∈ Rn×p is a
A1,1 · · · A1,n matrix of size Rn×p , such that:
!
A= .. .. ∈ Rm×n
. . aT aT
 
r,1 bc,1 ··· r,1 bc,p n
Am,1 · · · Am,n
.. ..
X
AB =  = ac,i bT
r,i ∈ R
n×p
Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly . .
T
ar,m bc,1 ··· T
ar,m bc,p i=1
called a column-vector.
r Identity matrix – The identity matrix I ∈ Rn×n is a square matrix with ones in its diagonal
and zero everywhere else: where aT
r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
T
 1 0 ··· 0  tively.
.. .. .
. . ..  r Transpose – The transpose of a matrix A ∈ Rm×n , noted AT , is such that its entries are
I= 0

.. . . ..
 flipped:
. . . 0
0 ··· 0 1 ∀i,j, i,j = Aj,i
AT
Remark: for all matrices A ∈ Rn×n , we have A × I = I × A = A.
Remark: for matrices A,B, we have (AB)T = B T AT .
r Diagonal matrix – A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values in
its diagonal and zero everywhere else: r Inverse – The inverse of an invertible square matrix A is noted A−1 and is the only matrix
 d1 0 · · · 0  such that:
.. .. ..
. .
D= 0 . AA−1 = A−1 A = I
 
. .. ..

.. . . 0
0 ··· 0 dn Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1 =
B −1 A−1
Remark: we also note D as diag(d1 ,...,dn ).
r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
n
Matrix operations X
tr(A) = Ai,i
r Vector-vector multiplication – There are two types of vector-vector products: i=1

• inner product: for x,y ∈ Rn , we have:


Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
n
X r Determinant – The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is
xT y = xi yi ∈ R expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
i=1 column, as follows:

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

n
X A = AT and ∀x ∈ Rn , xT Ax > 0
det(A) = |A| = (−1)i+j Ai,j |A\i,\j |
j=1 Remark: similarly, a matrix A is said to be positive definite, and is noted A  0, if it is a PSD
matrix which satisfies for all non-zero vector x, xT Ax > 0.
Remark: A is invertible if and only if |A| 6= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
Matrix properties Az = λz
r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
and antisymmetric parts as follows: r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
A + AT A − AT
A= +
2 2 ∃Λ diagonal, A = U ΛU T
| {z } | {z }
Symmetric Antisymmetric
r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular-
value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m
r Norm – A norm is a function N : V −→ [0, + ∞[ where V is a vector space, and such that unitary, Σ m × n diagonal and V n × n unitary matrices, such that:
for all x,y ∈ V , we have:
A = U ΣV T
• N (x + y) 6 N (x) + N (y)

• N (ax) = |a|N (x) for a scalar Matrix calculus


• if N (x) = 0, then x = 0
r Gradient – Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f
For x ∈ V , the most commonly used norms are summed up in the table below: with respect to A is a m × n matrix, noted ∇A f (A), such that:
∂f (A)
 
Norm Notation Definition Use case ∇A f (A) =
i,j ∂Ai,j
n
X
Manhattan, L1 ||x||1 |xi | LASSO regularization Remark: the gradient of f is only defined when f is a function that returns a scalar.
i=1 r Hessian – Let f : Rn → R be a function and x ∈ Rn be a vector. The hessian of f with
v respect to x is a n × n symmetric matrix, noted ∇2x f (x), such that:
u n
∂ 2 f (x)
uX  
Euclidean, L2 ||x||2 t x2i Ridge regularization ∇2x f (x) =
i,j ∂xi ∂xj
i=1

! p1 Remark: the hessian of f is only defined when f is a function that returns a scalar.
n
r Gradient operations – For matrices A,B,C, the following gradient properties are worth
X
p-norm, Lp ||x||p xpi Hölder inequality
having in mind:
i=1
∇A tr(AB) = B T ∇AT f (A) = (∇A f (A))T
Infinity, L∞ ||x||∞ max |xi | Uniform convergence
i

∇A tr(ABAT C) = CAB + C T AB T ∇A |A| = |A|(A−1 )T


r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors
in the set can be defined as a linear combination of the others.
Remark: if no vector can be written this way, then the vectors are said to be linearly independent.
r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the
vector space generated by its columns. This is equivalent to the maximum number of linearly
independent columns of A.

r Positive semi-definite matrix – A matrix A ∈ Rn×n is positive semi-definite (PSD) and


is noted A  0 if we have:

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

n
VIP Refresher: Probabilities and Statistics Remark: for any event B in the sample space, we have P (B) =
X
P (B|Ai )P (Ai ).
i=1

Afshine Amidi and Shervine Amidi r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space.
We have:
P (B|Ak )P (Ak )
August 6, 2018 P (Ak |B) = n
X
P (B|Ai )P (Ai )
i=1
Introduction to Probability and Combinatorics
r Sample space – The set of all possible outcomes of an experiment is known as the sample r Independence – Two events A and B are independent if and only if we have:
space of the experiment and is denoted by S. P (A ∩ B) = P (A)P (B)
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred. Random Variables
r Axioms of probability – For each event E, we denote P (E) as the probability of event E r Random variable – A random variable, often noted X, is a function that maps every element
occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms: in a sample space to a real line.
n
! n
[ X r Cumulative distribution function (CDF) – The cumulative distribution function F ,
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei ) which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is
x→−∞ x→+∞
i=1 i=1
defined as:
F (x) = P (X 6 x)
r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
given order. The number of such arrangements is given by P (n, r), defined as: Remark: we have P (a < X 6 B) = F (b) − F (a).
n!
P (n, r) = r Probability density function (PDF) – The probability density function f is the probability
(n − r)! that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know
r Combination – A combination is an arrangement of r objects from a pool of n objects, where in the discrete (D) and the continuous (C) cases.
the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n!
C(n, r) = = Case CDF F PDF f Properties of PDF
r! r!(n − r)! X X
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
xi 6x j
ˆ x ˆ +∞
dF
Conditional Probability (C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1
−∞ dx −∞
r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) = r Variance – The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the
P (B) spread of its distribution function. It is determined as follows:
Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B). Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2

r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
if we have: r Standard deviation – The standard deviation of a random variable, often noted σ, is a
n
measure of the spread of its distribution function which is compatible with the units of the
[ actual random variable. It is determined as follows:
∀i 6= j, Ai ∩ Aj = ∅ and Ai = S p
i=1 σ= Var(X)

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

r Expectation and Moments of the Distribution – Here are the expressions of the expected r Marginal density and cumulative distribution – From the joint density probability
value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function function fXY , we have:
(ω) for the discrete and continuous cases:
Case Marginal density Cumulative function
E[X k ] (ω)
X XX
Case E[X] E[g(X)] (D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
n
X n
X n
X n
X j xi 6x yj 6y
(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiωxi ˆ ˆ ˆ
+∞ x y
i=1 i=1 i=1 i=1 (C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (x0 ,y 0 )dx0 dy 0
ˆ +∞ ˆ +∞ ˆ +∞ ˆ +∞
−∞ −∞ −∞

(C) xf (x)dx g(x)f (x)dx xk f (x)dx f (x)eiωx dx


−∞ −∞ −∞ −∞
r Distribution of a sum of independent random variables – Let Y = X1 + ... + Xn with
X1 , ..., Xn independent. We have:
Remark: we have eiωx = cos(ωx) + i sin(ωx). n
Y
Y (ω) = ψXk (ω)
r Revisiting the kth moment – The kth moment can also be computed with the characteristic
function as follows: k=1
 
1 ∂k ψ r Covariance – We define the covariance of two random variables X and Y , that we note σXY
2
E[X k ] =
ik ∂ω k or more commonly Cov(X,Y ), as follows:
ω=0

Cov(X,Y ) , σXY
2
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
r Correlation – By noting σX , σY the standard deviations of X and Y , we define the correlation
between the random variables X and Y , noted ρXY , as follows:

fY (y) = fX (x)
dx
dy
2
σXY
ρXY =
σX σY
r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
may depend on c. We have: Remarks: For any X, Y , we have ρXY ∈ [−1,1]. If X and Y are independent, then ρXY = 0.
ˆ ˆ b r Main distributions – Here are the main distributions to have in mind:

∂ b ∂b ∂a ∂g
g(x)dx = · g(b) − · g(a) + (x)dx
∂c a ∂c ∂c a ∂c
Type Distribution PDF (ω) E[X] Var(X)
n
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard X ∼ B(n, p) P (X = x) = px q n−x (peiω + q)n np npq
deviation σ. For k, σ > 0, we have the following inequality: x
Binomial x ∈ [[0,n]]
1 (D)
P (|X − µ| > kσ) 6
k2 µx −µ iω
X ∼ Po(µ) P (X = x) = e eµ(e −1) µ µ
x!
Poisson x∈N
Jointly Distributed Random Variables
1 eiωb − eiωa a+b (b − a)2
X ∼ U (a, b) f (x) =
b−a (b − a)iω 2 12
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y , Uniform x ∈ [a,b]
is defined as follows:
2
fXY (x,y) 1 −1
x−µ
1 2
σ2
fX|Y (x) = (C) X ∼ N (µ, σ) f (x) = √ e2 σ
eiωµ− 2 ω µ σ2
fY (y) 2πσ
Gaussian x∈R
1 1 1
r Independence – Two random variables X and Y are said to be independent if we have: X ∼ Exp(λ) f (x) = λe−λx
1− iω
λ
λ λ2
fXY (x,y) = fX (x)fY (y) Exponential x ∈ R+

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Parameter estimation
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
are independent and identically distributed with X.
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected
value of the distribution of θ̂ and the true value, i.e.:
Bias(θ̂) = E[θ̂] − θ

Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.


r Sample mean and variance – The sample mean and the sample variance of a random
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
noted X and s2 respectively, and are such that:
n n
1 X 1 X
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1

r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance σ 2 , then we have:
σ
 
X ∼ N µ, √
n→+∞ n

Stanford University 3 Fall 2018


3. Machine learning
• History of AI

• Structure of Machine learning


• Applications of Machine learning

• Top Algorithms of machine and deep learning


3.1. Supervised learning
CS 229 – Machine Learning https://stanford.edu/~shervine

Least squared Logistic Hinge Cross-entropy


VIP Cheatsheet: Supervised Learning 1  
(y − z)2 log(1 + exp(−yz)) max(0,1 − yz) − y log(z) + (1 − y) log(1 − z)
2

Afshine Amidi and Shervine Amidi

September 9, 2018

Linear regression Logistic regression SVM Neural Network


Introduction to Supervised Learning
r Cost function – The cost function J is commonly used to assess the performance of a model,
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we and is defined with the loss function L as follows:
want to build a classifier that learns how to predict y from x.
m
r Type of prediction – The different types of predictive models are summed up in the table X
below: J(θ) = L(hθ (x(i) ), y (i) )
i=1

Regression Classifier
r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes θ ←− θ − α∇J(θ)

r Type of model – The different models are summed up in the table below:

Discriminative model Generative model


Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)
What’s learned Decision boundary Probability distributions of the data

Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
log(L(θ)) which is easier to optimize. We have:
θopt = arg max L(θ)
Examples Regressions, SVMs GDA, Naive Bayes θ

r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
Notations and general concepts that `0 (θ) = 0. Its update rule is as follows:
`0 (θ)
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input θ θ−
`00 (θ)
data x(i) , the model prediction output is hθ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different −1
they are. The common loss functions are summed up in the table below: θ θ − ∇2θ `(θ) ∇θ `(θ)

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Linear regression Generalized Linear Models

We assume here that y|x; θ ∼ N (µ,σ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost
η, a sufficient statistic T (y) and a log-partition function a(η) as follows:
function is a closed-form solution such that:

θ = (X T X)−1 X T y p(y; η) = b(y) exp(ηT (y) − a(η))

Remark: we will often have T (y) = y. Also, exp(−a(η)) can be seen as a normalization param-
r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
X   (i)
∀j, θj ← θj + α y (i) − hθ (x(i) ) xj
i=1
Distribution η T (y) a(η) b(y)
φ

Bernoulli log 1−φ
y log(1 + exp(η)) 1
Remark: the update rule is a particular case of the gradient ascent.
 
η2 2
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
√1 exp − y2

weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as: 1
Poisson log(λ) y eη
  y!
(x(i) − x)2
w(i) (x) = exp − eη

2τ 2 Geometric log(1 − φ) y log 1−eη
1

Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x
1
∀z ∈ R, g(z) = ∈]0,1[
1 + e−z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; θ ∼ Bernoulli(φ). We have the following
form:

1 Support Vector Machines


φ = p(y = 1|x; θ) = = g(θT x)
1 + exp(−θT x)
The goal of support vector machines is to find the line that maximizes the minimum distance to
the line.
Remark: there is no closed form solution for the case of logistic regressions.
r Optimal margin classifier – The optimal margin classifier h is such that:
r Softmax regression – A softmax regression, also called a multiclass logistic regression, is
used to generalize logistic regression when there are more than 2 outcome classes. By convention,
we set θK = 0, which makes the Bernoulli parameter φi of each class i equal to: h(x) = sign(wT x − b)

exp(θiT x)
φi = where (w, b) ∈ Rn × R is the solution of the following optimization problem:
K
X
exp(θjT x) 1
min ||w||2 such that y (i) (wT x(i) − b) > 1
j=1 2

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Gaussian Discriminant Analysis


r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are
such that:

y ∼ Bernoulli(φ)

x|y = 0 ∼ N (µ0 ,Σ) and x|y = 1 ∼ N (µ1 ,Σ)

r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:

φ
b µbj (j = 0,1) Σ
b
Remark: the line is defined as wT x − b = 0 . m Pm m
1 X 1
i=1 {y (i) =j}
x(i) 1 X
1{y(i) =1} (x(i) − µy(i) )(x(i) − µy(i) )T
r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows: m
1{y(i) =j}
P
m m
i=1 i=1 i=1
L(z,y) = [1 − yz]+ = max(0,1 − yz)

r Kernel – Given a feature mapping φ, we define the kernel K to be defined as: Naive Bayes
K(x,z) = φ(x)T φ(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
  independent:
||x−z||2
In practice, the kernel K defined by K(x,z) = exp − 2σ2 is called the Gaussian kernel
n
and is commonly used.
Y
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1

r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
l ∈ [[1,L]]

(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = × #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}

Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l
X These methods can be used for both regression and classification problems.
L(w,b) = f (w) + βi hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.

Remark: the coefficients βi are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:

Stanford University 3 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Adaptive boosting Gradient boosting


m
- High weights are put on errors to - Weak learners trained 1 X
b(h) = 1{h(x(i) )6=y(i) }
improve at the next boosting step on remaining errors m
i=1
- Known as Adaboost

r Probably Approximately Correct (PAC) – PAC is a framework under which numerous


Other non-parametric approaches results on learning theory were proved, and has the following set of assumptions:

r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:

∃h ∈ H, ∀i ∈ [[1,d]], h(x(i) ) = y (i)

r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
the sample size m be fixed. Then, with probability of at least 1 − δ, we have:

 r
1 2k
  
(h) 6 min (h) + 2
b log
h∈H 2m δ

r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis


class H, noted VC(H) is the size of the largest set that is shattered by H.
Remark: the VC dimension of H = {set of linear classifiers in 2 dimensions} is 3.
Learning Theory
r Union bound – Let A1 , ..., Ak be k events. We have:

P (A1 ∪ ... ∪ Ak ) 6 P (A1 ) + ... + P (Ak )


r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training
examples. With probability at least 1 − δ, we have:
r  
d m 1 1
   
h) 6
(b min (h) + O log + log
h∈H m d m δ

r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
b be their sample mean and γ > 0 fixed. We have:

P (|φ − φ
b| > γ) 6 2 exp(−2γ 2 m)

Remark: this inequality is also known as the Chernoff bound.

r Training error – For a given classifier h, we define the training error b


(h), also known as the
empirical risk or empirical error, to be as follows:

Stanford University 4 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Machine Learning Tips r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below:

Metric Formula Equivalent


Afshine Amidi and Shervine Amidi
TP
True Positive Rate Recall, sensitivity
September 9, 2018 TPR
TP + FN

FP
False Positive Rate 1-specificity
TN + FP
Metrics FPR

Given a set of data points {x(1) , ..., x(m) }, where each x(i) has n features, associated to a set of
outcomes {y (1) , ..., y (m) }, we want to assess a given classifier that learns how to predict y from r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
x. area below the ROC as shown in the following figure:

Classification
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
Predicted class
+ –

TP FN
+ False Negatives Regression
True Positives
Type II error r Basic metrics – Given a regression model f , the following metrics are commonly used to
Actual class assess the performance of the model:
FP TN
– False Positives Total sum of squares Explained sum of squares Residual sum of squares
True Negatives
Type I error m
X m
X m
X
SStot = (yi − y)2 SSreg = (f (xi ) − y)2 SSres = (yi − f (xi ))2
r Main metrics – The following metrics are commonly used to assess the performance of
i=1 i=1 i=1
classification models:

Metric Formula Interpretation


r Coefficient of determination – The coefficient of determination, often noted R2 or r2 ,
TP + TN provides a measure of how well the observed outcomes are replicated by the model and is defined
Accuracy Overall performance of model as follows:
TP + TN + FP + FN
SSres
TP R2 = 1 −
Precision How accurate the positive predictions are SStot
TP + FP
TP r Main metrics – The following metrics are commonly used to assess the performance of
Recall Coverage of actual positive sample
TP + FN regression models, by taking into account the number of variables n that they take into consid-
Sensitivity eration:
TN
Specificity Coverage of actual negative sample Mallow’s Cp AIC BIC Adjusted R2
TN + FP
2TP σ2
SSres + 2(n + 1)b   (1 − R2 )(m − 1)
F1 score Hybrid metric useful for unbalanced classes 2 (n + 2) − log(L) log(m)(n + 2) − 2 log(L) 1−
2TP + FP + FN m m−n−1

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

σ 2 is an estimate of the variance associated with each response.


where L is the likelihood and b LASSO Ridge Elastic Net
- Shrinks coefficients to 0 Makes coefficients smaller Tradeoff between variable
- Good for variable selection selection and small coefficients
Model selection
r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we
have as follows:

Training set Validation set Testing set


- Model is trained - Model is assessed - Model gives predictions
- Usually 80% of the dataset - Usually 20% of the dataset - Unseen data
- Also called hold-out
or development set
h i
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen ... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22
test set. These are represented in the figure below:
λ∈R λ∈R λ ∈ R, α ∈ [0,1]

r Model selection – Train model on training set, then evaluate on the development set, then
pick best performance model on the development set, and retrain all of that model on the whole
training set.
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a
model that does not rely too much on the initial training set. The different types are summed
up in the table below:
Diagnostics
k-fold Leave-p-out
r Bias – The bias of a model is the difference between the expected prediction and the correct
- Training on k − 1 folds and - Training on n − p observations and model that we try to predict for given data points.
assessment on the remaining one assessment on the p remaining ones
r Variance – The variance of a model is the variability of the model prediction for given data
- Generally k = 5 or 10 - Case p = 1 is called leave-one-out points.

r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
The most commonly used method is called k-fold cross-validation and splits the training data the model, the higher the variance.
into k folds to validate the model on one fold while training the model on the k − 1 other folds,
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
Underfitting Just right Overfitting
- High training error - Training error - Low training error
Symptoms - Training error close slightly lower than - Training error much
to test error test error lower than test error
- High bias - High variance

Regression

r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Classification

Deep learning

- Complexify model - Regularize


Remedies - Add more features - Get more data
- Train longer

r Error analysis – Error analysis is analyzing the root cause of the difference in performance
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor-
mance between the current and the baseline models.

Stanford University 3 Fall 2018


Cheat Sheet – Regression Analysis
What is Regression Analysis?
Fitting a function f(.) to datapoints yi=f(xi) under some error function. Based on the estimated
function and error, we have the following types of regression
1. Linear Regression:
Fits a line minimizing the sum of mean-squared error
for each datapoint.
2. Polynomial Regression:
Fits a polynomial of order k (k+1 unknowns) minimizing
the sum of mean-squared error for each datapoint.
3. Bayesian Regression:
For each datapoint, fits a gaussian distribution by
minimizing the mean-squared error. As the number of
data points xi increases, it converges to point
estimates i.e.
4. Ridge Regression:
Can fit either a line, or polynomial minimizing the sum
of mean-squared error for each datapoint and the
weighted L2 norm of the function parameters beta.
5. LASSO Regression:
Can fit either a line, or polynomial minimizing the the
sum of mean-squared error for each datapoint and the
weighted L1 norm of the function parameters beta.
6. Logistic Regression:
Can fit either a line, or polynomial with sigmoid
activation minimizing the binary cross-entropy loss for
each datapoint. The labels y are binary class labels.
Visual Representation:
Linear Regression Polynomial Regression Bayesian Linear Regression Logistic Regression
Label 1
y
y

Label 0

x x x x

Summary:
What does it fit? Estimated function Error Function
Linear A line in n dimensions
Polynomial A polynomial of order k
Bayesian Linear Gaussian distribution for each point
Ridge Linear/polynomial
LASSO Linear/polynomial
Logistic Linear/polynomial with sigmoid

Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here


Cheat Sheet – Regularization in ML
What is Regularization in ML?
• Regularization is an approach to address over-fitting in ML.
• Overfitted model fails to generalize estimations on test data
• When the underlying model to be learned is low bias/high
variance, or when we have small amount of data, the
estimated model is prone to over-fitting.
• Regularization reduces the variance of the model
Types of Regularization: Figure 1. Overfitting
1. Modify the loss function:
• L2 Regularization: Prevents the weights from getting too large (defined by L2 norm). Larger
the weights, more complex the model is, more chances of overfitting.

• L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
the weights, more complex the model is, more chances of overfitting. L1 regularization
introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
average magnitude of all weights

• Entropy: Used for the models that output probability. Forces the probability distribution
towards uniform distribution.

2. Modify data sampling:


• Data augmentation: Create more data from available data by randomly cropping, dilating,
rotating, adding small amount of noise etc.
• K-fold Cross-validation: Divide the data into k groups. Train on (k-1) groups and test on 1
group. Try all k possible combinations.

3. Change training approach:


• Injecting noise: Add random noise to the weights when they are being learned. It pushes the
model to be relatively insensitive to small variations in the weights, hence regularization
• Dropout: Generally used for neural networks. Connections between consecutive layers are
randomly dropped based on a dropout-ratio and the remaining network is trained in the
current iteration. In the next iteration, another set of random connections are dropped.
5-fold cross-validation Original Network Dropout-ratio = 30%
Test Train
Train Test Train

Train Test Train


Train Test Train

Train Test Connections = 16 Active = 11 (70%) Active = 11 (70%)

Figure 2. K-fold CV Figure 3. Drop-out


Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here
Cheat Sheet – Bias-Variance Tradeoff
What is Bias?
• Error between average model prediction and ground truth
• The bias of the estimated function tells us the capacity of the underlying model to
predict the values
What is Variance?
• Average variability in the model prediction for the given dataset
• The variance of the estimated function tells you how much the function can adjust
to the change in the dataset
High Bias Overly-simplified Model
Under-fitting
High error on both test and train data

High Variance Overly-complex Model


Over-fitting
Low error on train data and high on test
Starts modelling the noise in the input

Minimum Error

Bias variance Trade-off


• Increasing bias (not always) reduces variance and vice-versa
• Error = bias2 + variance +irreducible error
• The best model is where the error is reduced.
• Compromise between bias and variance
Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here
Cheat Sheet – Bayes Theorem and Classifier
What is Bayes’ Theorem?
• Describes the probability of an event, based on prior knowledge of conditions that might be
related to the event.

P(A B)
• How the probability of an event changes when
we have knowledge of another event Posterior
Probability
P(A) P(A B)
Usually, a better
estimate than P(A)
Bayes’ Theorem
Example
• Probability of fire P(F) = 1%
• Probability of smoke P(S) = 10%
Likelihood P(A) Evidence
• Prob of smoke given there is a fire P(S F) = 90%
• What is the probability that there is a fire given P(B A) Prior P(B)
we see a smoke P(F S)? Probability

Maximum Aposteriori Probability (MAP) Estimation


The MAP estimate of the random variable y, given that we have observed iid (x1, x2, x3, … ), is
given by. We try to accommodate our prior knowledge when estimating.
ˆMAP y that maximizes the product of
prior and likelihood

Maximum Likelihood Estimation (MLE)


The MAP estimate of the random variable y, given that we have observed iid (x1, x2, x3, … ), is
given by. We assume we don’t have any prior knowledge of the quantity being estimated.
ˆ y that maximizes only the
MLE
likelihood
MLE is a special case of MAP where our prior is uniform (all values are equally likely)

Naïve Bayes’ Classifier (Instantiation of MAP as classifier)


Suppose we have two classes, y=y1 and y=y2. Say we have more than one evidence/features (x1,
x2, x3, … ), using Bayes’ theorem

Naïve Bayes’ theorem assumes the features (x1, x2, … ) are i.i.d. i.e

Source: https://www.cheatsheets.aqeel-anwar.com
Cheat Sheet – Imbalanced Data in Classification
Blue: Label 1

Green: Label 0 Correct Predictions


Accuracy =
Total Predictions
Classifier that always predicts label blue yields prediction accuracy of 90%
Accuracy doesn’t always give the correct insight about your trained model
Accuracy: %age correct prediction Correct prediction over total predictions One value for entire network
Precision: Exactness of model From the detected cats, how many were Each class/label has a value
actually cats
Recall: Completeness of model Correctly detected cats over total cats Each class/label has a value
F1 Score: Combines Precision/Recall Harmonic mean of Precision and Recall Each class/label has a value

Performance metrics associated with Class 1


(Is your prediction correct?) (What did you predict)
Actual Labels True Negative
1 0
(Your prediction is correct) (You predicted 0)
TP FP
True False
Predicted Labels

Precision = False +ve rate =


1

Positive Positive TP + FP TN + FP
(Prec x Rec) TP + TN
F1 score = 2x Accuracy =
(Prec + Rec) TP + FN + FP + TN
False True
0

Negative Negative TN TP
Specificity = Recall, Sensitivity =
TN + FP True +ve rate TP + FN

Possible solutions
1. Data Replication: Replicate the available data until the Blue: Label 1
number of samples are comparable Green: Label 0
2. Synthetic Data: Images: Rotate, dilate, crop, add noise to Blue: Label 1
existing input images and create new data Green: Label 0
3. Modified Loss: Modify the loss to reflect greater error when 𝑙𝑜𝑠𝑠 = 𝑎 ∗ 𝒍𝒐𝒔𝒔𝒈𝒓𝒆𝒆𝒏 + 𝑏 ∗ 𝒍𝒐𝒔𝒔𝒃𝒍𝒖𝒆 𝑎>𝑏
misclassifying smaller sample set
4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly
separable (Con: Overfitting)
Increase model
complexity

No straight line (y=ax) passing through origin can perfectly Straight line (y=ax+b) can perfectly separate data.
separate data. Best solution: line y=0, predict all labels blue Green class will no longer be predicted as blue

Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here


Cheat Sheet – Ensemble Learning in ML
What is Ensemble Learning? Wisdom of the crowd
Combine multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy.

Types of Ensemble Learning: N number of weak learners


1.Bagging: Trains N different weak models (usually of same types – homogenous) with N non-overlapping subset of the
input dataset in parallel. In the test phase, each model is evaluated. The label with the greatest number of predictions is
selected as the prediction. Bagging methods reduces variance of the prediction

2.Boosting: Trains N different weak models (usually of same types – homogenous) with the complete dataset in a
sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.

3.Stacking: Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
labels are fed to the meta learner which generates the final prediction.

The block diagrams, and comparison table for each of these three methods can be seen below.
Ensemble Method – Boosting Ensemble Method – Bagging
Input Dataset Step #1 Input Dataset
Step #1 Create N subsets
Assign equal weights Complete dataset from original Subset #1 Subset #2 Subset #3 Subset #4
to all the datapoints dataset, one for each
in the dataset weak model

Uniform weights
Step #2
Train each weak
Weak Model Weak Model Weak Model Weak Model
Step #2a Step #2b model with an
Train a weak model Train Weak • Based on the final error on the independent #1 #2 #3 #4
with equal weights to trained weak model, calculate a subset, in
Model #1 parallel
all the datapoints scalar alpha.
• Use alpha to increase the weights of
wrongly classified points, and
decrease the weights of correctly
alpha1 Adjusted weights classified points
Step #3
In the test phase, predict from
each weak model and vote their Voting
Step #3b predictions to get final prediction
Step #3a Train Weak • Based on the final error on the
Train a weak model Model #2 trained weak model, calculate a
with adjusted weights scalar alpha.
on all the datapoints • Use alpha to increase the weights of
in the dataset wrongly classified points, and Final Prediction
decrease the weights of correctly
alpha2 Adjusted weights classified points

Train Weak Ensemble Method – Stacking


Model #3
Step #1
Create 2 subsets from Input Dataset
original dataset, one
for training weak Subset #1 – Weak Learners Subset #3#2 – Meta Learner
Subset
alpha3 Adjusted weights models and one for
meta-model

Train Weak
Step #(n+1)a Model #4 Step #2
Train a weak model Train each weak
with adjusted weights model with the
Train Weak Train Weak Train Weak Train Weak
on all the datapoints weak learner Model #1 Model #2 Model #3 Model #4
in the dataset dataset
alpha3

x x x x Input Dataset
Subset #1 – Weak Learners Subset #2 – Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction Step #3
Voting Train a meta-
learner for which Trained Weak Trained Weak Trained Weak Trained Weak
the input is the
outputs of the Model Model Model Model
weak models for #1 #2 #3 #4
the Meta Learner
dataset
Final Prediction

Parameter Bagging Boosting Stacking


Meta Model
Focuses on Reducing variance Reducing bias Improving accuracy
Nature of weak
Homogenous Homogenous Heterogenous Step #4
learners is In the test phase, feed the input to the
weak models, collect the output and feed
Weak learners are Learned voting it to the meta model. The output of the
Final Prediction
Simple voting Weighted voting meta model is the final prediction
aggregated by (meta-learner)

Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here


3.2. Unsupervised learning
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Unsupervised Learning

Afshine Amidi and Shervine Amidi

September 9, 2018

Introduction to Unsupervised Learning


k-means clustering
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data
{x(1) ,...,x(m) }. We note c(i) the cluster of data point i and µj the center of cluster j.
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
following inequality: algorithm repeats the following step until convergence:
E[f (X)] > f (E[X]) m
X
1{c(i) =j} x(i)
i=1
Expectation-Maximization c(i) = arg min||x(i) − µj ||2 and µj = m
j X
1{c(i) =j}
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are i=1
latent variables:

Setting Latent variable z x|z Comments

Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk

Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn

r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at


estimating the parameter θ through maximum likelihood estimation by repeatedly constructing
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
r Distortion function – In order to see if the algorithm converges, we look at the distortion
function defined as follows:
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
a particular cluster z (i) as follows: m
X
J(c,µ) = ||x(i) − µc(i) ||2
Qi (z (i)
) = P (z (i)
|x (i)
; θ) i=1

• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering

r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that


Xˆ   build nested clusters in a successive manner.
P (x(i) ,z (i) ; θ)
θi = argmax Qi (z (i) ) log dz (i) r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
θ z (i) Qi (z (i) )
i different objective functions, which is summed up in the table below:

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj − µj 1 X (i) 1 X (i)
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m
i=1 i=1
Clustering assessment metrics
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since 1 X T
we don’t have the ground truth labels as was the case in the supervised learning setting. • Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
r Silhouette coefficient – By noting a and b the mean distance between a sample and all i=1
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b−a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between


and within-clustering dispersion matrices respectively defined as

k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1

the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:

Tr(Bk ) N −k
s(k) = × Independent component analysis
Tr(Wk ) k−1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = λz
• Write the probability of x = As = W −1 s as:

r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real n


Y
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have: p(x) = ps (wiT x) · |W |
i=1
∃Λ diagonal, A = U ΛU T
• Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of sigmoid function as:
matrix A.
m n
!
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction X X  
0
technique that projects the data on k dimensions by maximizing the variance of the data as l(W ) = log g (wjT x(i) ) + log |W |
follows: i=1 j=1

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:

1 − 2g(w1T x(i) )
  
1 − 2g(w2 x ) x(i) T + (W T )−1 
T (i)
W ←− W + α  .
..
 
1 − 2g(wn T x(i) )

Stanford University 3 Fall 2018


Cheat Sheet – PCA Dimensionality Reduction
What is PCA?
• Based on the dataset find a new set of orthogonal feature vectors in such a way that the
data spread is maximum in the direction of the feature vector (or dimension)
• Rates the feature vector in the decreasing order of data spread (or variance)
• The datapoints have maximum variance in the first feature vector, and minimum variance
in the last feature vector
• The variance of the datapoints in the direction of feature vector can be termed as a
measure of information in that direction.
Steps
1. Standardize the datapoints
2. Find the covariance matrix from the given datapoints
3. Carry out eigen-value decomposition of the covariance matrix
4. Sort the eigenvalues and eigenvectors

Dimensionality Reduction with PCA


• Keep the first m out of n feature vectors rated by PCA. These m vectors will be the best m
vectors preserving the maximum information that could have been preserved with m
vectors on the given dataset
Steps:
1. Carry out steps 1-4 from above
2. Keep first m feature vectors from the sorted eigenvector matrix
3. Transform the data for the new basis (feature vectors)
4. The importance of the feature vector is proportional to the magnitude of the eigen value

Figure 1 Figure 2
Feature # 1 (F1)

FeFeature # 1

Variance
Variance

1
e#

2
ur

#
re
at

atu
Fe
ew

w
Ne
N

F2 F1 Feature # 2 (F2) Feature # 2 F2 F1

Figure 3 Figure 1: Datapoints with feature vectors as


x and y-axis
Figure 2: The cartesian coordinate system is
rotated to maximize the standard deviation
Variance
Fe Feature # 1

along any one axis (new feature # 2)


1
e#

2 Figure 3: Remove the feature vector with


ur

e# minimum standard deviation of datapoints


ur
at

at
Fe F2 F2 (new feature # 1) and project the data on
w
w

Ne
Ne

Feature # 2 new feature # 2

Source: https://www.cheatsheets.aqeel-anwar.com
4. Deep learning
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Deep Learning r Learning rate – The learning rate, often noted η, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called
Adam, which is a method that adapts the learning rate.

r Backpropagation – Backpropagation is a method to update the weights in the neural network


Afshine Amidi and Shervine Amidi by taking into account the actual output and the desired output. The derivative with respect
to weight w is computed using chain rule and is of the following form:
September 15, 2018 ∂L(z,y) ∂L(z,y) ∂a ∂z
= × ×
∂w ∂a ∂z ∂w

Neural Networks As a result, the weight is updated as follows:

Neural networks are a class of models that are built with layers. Commonly used types of neural ∂L(z,y)
networks include convolutional and recurrent neural networks. w ←− w − η
∂w
r Architecture – The vocabulary around neural networks architectures is described in the
figure below:
r Updating weights – In a neural network, weights are updated as follows:

• Step 1: Take a batch of training data.

• Step 2: Perform forward propagation to obtain the corresponding loss.

• Step 3: Backpropagate the loss to get the gradients.


By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:
• Step 4: Use the gradients to update the weights of the network.
[i] [i] T [i]
zj = wj x + bj

where we note w, b, z the weight, bias and output respectively. r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
r Activation function – Activation functions are used at the end of a hidden unit to introduce p or kept with probability 1 − p.
non-linear complexities to the model. Here are the most common ones:

Sigmoid Tanh ReLU Leaky ReLU Convolutional Neural Networks


1 ez − e−z r Convolutional layer requirement – By noting W the input volume size, F the size of the
g(z) = g(z) = g(z) = max(0,z) g(z) = max(z,z)
1 + e−z ez + e−z convolutional layer neurons, P the amount of zero padding, then the number of neurons N that
with   1 fit in a given volume is such that:

W − F + 2P
N = +1
S

r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi }.


By noting µB , σB
2 the mean and variance of that we want to correct to the batch, it is done as

follows:

xi − µ B
xi ←− γ p +β
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is 2 +
σB
commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z) It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Recurrent Neural Networks • We initialize the value:

r Types of gates – Here are the different types of gates that we encounter in a typical recurrent V0 (s) = 0
neural network:
• We iterate the value based on the values before:
Input gate Forget gate Output gate Gate
" #
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing? X
0 0
Vi+1 (s) = R(s) + max γPsa (s )Vi (s )
a∈A
s0 ∈S
r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids
the vanishing gradient problem by adding ’forget’ gates.
r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
Reinforcement Learning and Control
#times took action a in state s and got to s0
Psa (s0 ) =
The goal of reinforcement learning is for an agent to learn how to evolve in an environment. #times took action a in state s
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },γ,R)
where: r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:

• S is the set of states


h i
Q(s,a) ← Q(s,a) + α R(s,a,s0 ) + γ max Q(s0 ,a0 ) − Q(s,a)
a0
• A is the set of actions
• {Psa } are the state transition probabilities for s ∈ S and a ∈ A

• γ ∈ [0,1[ is the discount factor

• R : S × A −→ R or R : S −→ R is the reward function that the algorithm wants to


maximize

r Policy – A policy π is a function π : S −→ A that maps states to actions.


Remark: we say that we execute a given policy π if given a state s we take the action a = π(s).

r Value function – For a given policy π and a given state s, we define the value function V π
as follows:
h i
V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...|s0 = s,π


r Bellman equation – The optimal Bellman equations characterizes the value function V π
of the optimal policy π ∗ :
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S

Remark: we note that the optimal policy π ∗ for a given state s is such that:
X
π ∗ (s) = argmax Psa (s0 )V ∗ (s0 )
a∈A
s0 ∈S

r Value iteration algorithm – The value iteration algorithm is in two steps:

Stanford University 2 Fall 2018


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Super VIP Cheatsheet: Deep Learning 1 Convolutional Neural Networks

Afshine Amidi and Shervine Amidi 1.1 Overview

November 25, 2018


r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs,
are a specific type of neural networks that are generally composed of the following layers:
Contents

1 Convolutional Neural Networks 2


1.1 Overview . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.2 Types of layer . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.3 Filter hyperparameters . . . . . . . . . . . . .. . . . . . . . . . . . . 2
1.4 Tuning hyperparameters . . . . . . . . . . . .. . . . . . . . . . . . . 3
1.5 Commonly used activation functions . . . . . .. . . . . . . . . . . . . 3
1.6 Object detection . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 4 The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
1.6.1 Face verification and recognition . . . .. . . . . . . . . . . . . 5 that are described in the next sections.
1.6.2 Neural style transfer . . . . . . . . . .. . . . . . . . . . . . . 5
1.6.3 Architectures using computational tricks . . . . . . . . . . . . 6

2 Recurrent Neural Networks 7


2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Handling long term dependencies . . . . . . . . . . . . . . . . . . . . 8
1.2 Types of layer
2.3 Learning word representation . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Motivation and notations . . . . . . . . . . . . . . . . . . . 9
2.3.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . 9
r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform
2.4 Comparing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa-
2.5 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 rameters include the filter size F and stride S. The resulting output O is called feature map or
activation map.
2.6 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Deep Learning Tips and Tricks 11


3.1 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Finding optimal weights . . . . . . . . . . . . . . . . . . . . . 12
3.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Remark: the convolution step can be generalized to the 1D and 3D cases as well.
3.3.1 Weights initialization . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Optimizing convergence . . . . . . . . . . . . . . . . . . . . . 12
r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 after a convolution layer, which does some spatial invariance. In particular, max and average
3.5 Good practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 pooling are special kinds of pooling where the maximum and average value is taken, respectively.

Stanford University 1 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Max pooling Average pooling


Each pooling operation selects the Each pooling operation averages
Purpose
maximum value of the current view the values of the current view
r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:

Valid Same Full


Illustration j I e−I+F −S
k
Sd S
Pstart = 2 Pstart ∈ [[0,F − 1]]
Value P =0 l I e−I+F −S
m
Sd S
Pend = Pend = F − 1
2
- Preserves detected features - Downsamples feature map
Comments
- Most commonly used - Used in LeNet

Illustration
r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
- Maximum padding
- No padding - Padding such that feature
l m such that end
map size has size I
convolutions are
- Drops last S
Purpose applied on the limits
convolution if - Output size is
of the input
dimensions do not mathematically convenient
match - Filter ’sees’ the input
- Also called ’half’ padding
end-to-end

1.4 Tuning hyperparameters


r Parameter compatibility in convolution layer – By noting I the length of the input
1.3 Filter hyperparameters volume size, F the length of the filter, P the amount of zero padding, S the stride, then the
output size O of the feature map along that dimension is given by:
The convolution layer contains filters for which it is important to know the meaning behind its
hyperparameters. I − F + Pstart + Pend
O= +1
r Dimensions of a filter – A filter of size F × F applied to an input containing C channels is S
a F × F × C volume that performs convolutions on an input of size I × I × C and produces an
output feature map (also called activation map) of size O × O × 1.

Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.

r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in
by which the window moves after each operation. the formula above.

Stanford University 2 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

r Understanding the complexity of the model – In order to assess the complexity of a ReLU Leaky ReLU ELU
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows: g(z) = max(z,z) g(z) = max(α(ez − 1),z)
g(z) = max(0,z)
with   1 with α  1
CONV POOL FC

Illustration

Input size I ×I ×C I ×I ×C Nin


Non-linearity complexities Addresses dying ReLU
Differentiable everywhere
Output size O×O×K O×O×C Nout biologically interpretable issue for negative values
Number of
(F × F × C + 1) · K 0 (Nin + 1) × Nout
parameters r Softmax – The softmax step can be seen as a generalized logistic function that takes as input
a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax
- Input is flattened function at the end of the architecture. It is defined as follows:
- One bias parameter
- Pooling operation - One bias parameter
per filter
done channel-wise per neuron  p1 
Remarks - In most cases, S < F .. e xi
- The number of FC p= where pi =
- A common choice - In most cases, S = F . n
neurons is free of pn X
for K is 2C e xj
structural constraints
j=1

r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and 1.6 Object detection
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula: r Types of models – There are 3 main types of object recognition algorithms, for which the
k j−1 nature of what is predicted is different. They are described in the table below:
X Y
Rk = 1 + (Fj − 1) Si
Classification
j=1 i=0 Image classification Detection
w. localization
In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =
5.

- Classifies a picture - Detects object in a picture - Detects up to several objects


- Predicts probability of in a picture
- Predicts probability object and where it is - Predicts probabilities of objects
of object located and where they are located
Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN
1.5 Commonly used activation functions
r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g r Detection – In the context of object detection, different methods are used depending on
that is used on all elements of the volume. It aims at introducing non-linearities to the network. whether we just want to locate the object or detect a more complex shape in the image. The
Its variants are summarized in the table below: two main ones are summed up in the table below:

Stanford University 3 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Bounding box detection Landmark detection


- Detects a shape or characteristics of
Detects the part of the image where
an object (e.g. eyes)
the object is located
- More granular

r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:

• Step 1: Divide the input image into a G × G grid.

• Step 2: For each grid cell, run a CNN that predicts y of the following form:
Box of center (bx ,by ), height bh
Reference points (l1x ,l1y ), ...,(lnx ,lny )  T
and width bw y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,... ∈ RG×G×k×(5+p)
| {z }
repeated k times

r Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
box Ba . It is defined as: detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were
detected, and k is the number of anchor boxes.
Bp ∩ Ba
IoU(Bp ,Ba ) = • Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
Bp ∪ Ba
lapping bounding boxes.

Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx , ..., cp have to be ignored.
Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5. r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes. detection algorithm to find most probable objects in those bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:

• Step 1: Pick the box with the largest prediction probability.


Remark: although the original algorithm is computationally expensive and slow, newer archi-
• Step 2: Discard any box having an IoU > 0.5 with the previous box. tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.

Stanford University 4 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

1.6.1 Face verification and recognition

r Types of models – Two main types of model are summed up in table below:

Face verification Face recognition


- Is this the correct person? - Is this one of the K persons in the database?
- One-to-one lookup - One-to-many lookup

r Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH × nw × nc

r Content cost function – The content cost function Jcontent (C,G) is used to determine how
the generated image G differs from the original content image C. It is defined as follows:

1 [l](C)
Jcontent (C,G) = ||a − a[l](G) ||2
2

r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
[l]
elements Gkk0 quantifies how correlated the channels k and k0 are. It is defined with respect to
r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited activations a[l] as follows:
training set to learn a similarity function that quantifies how different two given images are. The [l] [l]
similarity function applied to two images is often noted d(image 1, image 2). n nw
H X
X
[l] [l] [l]
Gkk0 = aijk aijk0
r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
i=1 j=1
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).
Remark: the style matrix for the style image and the generated image are noted G[l](S) and
r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation G[l](G) respectively.
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+ r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
the margin parameter, this loss is defined as follows: generated image G differs from the style S. It is defined as follows:

`(A,P,N ) = max (d(A,P ) − d(A,N ) + α,0) nc 


1 1 X 2
[l] [l](S) [l](G)
Jstyle (S,G) = ||G[l](S) − G[l](G) ||2F = Gkk0 − Gkk0
(2nH nw nc )2 (2nH nw nc )2
k,k0 =1

r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters α,β, as follows:

J(G) = αJcontent (C,G) + βJstyle (S,G)

Remark: a higher value of α will make the model care more about the content while a higher
value of β will make it care more about the style.

1.6.3 Architectures using computational tricks


1.6.2 Neural style transfer r Generative Adversarial Network – Generative adversarial networks, also known as GANs,
are composed of a generative and a discriminative model, where the generative model aims at
r Motivation – The goal of neural style transfer is to generate an image G based on a given generating the most truthful output that will be fed into the discriminative which aims at
content C and a given style S. differentiating the generated and true image.

Stanford University 5 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

2 Recurrent Neural Networks

2.1 Overview
r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while having
hidden states. They are typically as follows:

Remark: use cases using variants of GANs include text to image, music generation and syn-
thesis.
r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
a[l+2] = g(a[l] + z [l+2] ) For each timestep t, the activation a<t> and the output y <t> are expressed as follows:

r Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1 a<t> = g1 (Waa a<t−1> + Wax x<t> + ba ) and y <t> = g2 (Wya a<t> + by )
convolution trick to lower the burden of computation.
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
functions
? ? ?

The pros and cons of a typical RNN architecture are summed up in the table below:

Advantages Drawbacks
- Possibility of processing input of any length - Computation being slow
- Model size not increasing with size of input - Difficulty of accessing information
- Computation takes into account from a long time ago
historical information - Cannot consider any future input
- Weights are shared across time for the current state

r Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:

Stanford University 6 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Type of RNN Illustration Example


T
∂L(T ) X ∂L(T )
=
One-to-one ∂W ∂W
(t)
t=1
Traditional neural network
Tx = Ty = 1

2.2 Handling long term dependencies


r Commonly used activation functions – The most common activation functions used in
One-to-many RNN modules are described below:
Music generation
Tx = 1, Ty > 1
Sigmoid Tanh RELU

1 ez − e−z
g(z) = g(z) = g(z) = max(0,z)
1 + e−z ez + e−z
Many-to-one
Sentiment classification
Tx > 1, Ty = 1

Many-to-many
Name entity recognition
Tx = Ty r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.

Many-to-many r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for
Machine translation the gradient, this phenomenon is controlled in practice.
Tx 6= Ty

r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:

Ty
X r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
y ,y) =
L(b y <t> ,y <t> )
L(b in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
t=1 are equal to:

Γ = σ(W x<t> + U a<t−1> + b)

r Backpropagation through time – Backpropagation is done at each point in time. At where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows: are summed up in the table below:

Stanford University 7 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Type of gate Role Used in 2.3 Learning word representation


Update gate Γu How much past should matter now? GRU, LSTM In this section, we note V the vocabulary and |V | its size.
Relevance gate Γr Drop previous information? GRU, LSTM
Forget gate Γf Erase a cell or not? LSTM 2.3.1 Motivation and notations
Output gate Γo How much to reveal of a cell? LSTM r Representation techniques – The two main ways of representing words are summed up in
the table below:
r GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM)
deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being 1-hot representation Word embedding
a generalization of GRU. Below is a table summing up the characterizing equations of each
architecture:
Gated Recurrent Unit Long Short-Term Memory
(GRU) (LSTM)

c̃<t> tanh(Wc [Γr ? a<t−1> ,x<t> ] + bc ) tanh(Wc [Γr ? a<t−1> ,x<t> ] + bc )

c<t> Γu ? c̃<t> + (1 − Γu ) ? c<t−1> Γu ? c̃<t> + Γf ? c<t−1>

a<t> c<t> Γo ? c<t>

- Noted ow - Noted ew
- Naive approach, no similarity information - Takes into account words similarity

Dependencies r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps
its 1-hot representation ow to its embedding ew as follows:
ew = Eow

Remark: learning the embedding matrix can be done using target/context likelihood models.

Remark: the sign ? denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures: 2.3.2 Word embeddings
Bidirectional Deep r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the
(BRNN) likelihood that a given word is surrounded by other words. Popular models include skip-gram,
(DRNN) negative sampling and CBOW.

r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting θt a parameter associated with t, the probability P (t|c) is given by:

Stanford University 8 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

exp(θtT ec )
P (t|c) =
|V |
X
exp(θjT ec )
j=1

Remark: summing over the whole vocabulary in the denominator of the softmax part makes
this model computationally expensive. CBOW is another word2vec model using the surrounding
words to predict a given word.
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at 2.5 Language model
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context r Overview – A language model aims at estimating the probability of a sentence P (y).
word c and a target word t, the prediction is expressed by:
r n-gram model – This model is a naive approach aiming at quantifying the probability that
P (y = 1|c,t) = σ(θtT ec ) an expression appears in a corpus by counting its number of appearance in the training data.
Remark: this method is less computationally expensive than the skip-gram model. r Perplexity – Language models are commonly assessed using the perplexity metric, also
known as PP, which can be interpreted as the inverse probability of the dataset normalized by
r GloVe – The GloVe model, short for global vectors for word representation, is a word em- the number of words T . The perplexity is such that the lower, the better and is defined as
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of follows:
times that a target i occurred with a context j. Its cost function J is as follows: ! T1
T
|V |
Y 1
1 X PP =
J(θ) = f (Xij )(θiT ej + bi + b0j − log(Xij ))2
P|V | (t) (t)
yj · b
yj
2 t=1 j=1
i,j=1
Remark: PP is commonly used in t-SNE.
here f is a weighting function such that Xi,j = 0 =⇒ f (Xi,j ) = 0.
(final)
Given the symmetry that e and θ play in this model, the final word embedding ew is given
by: 2.6 Machine translation
(final) e w + θw r Overview – A machine translation model is similar to a language model except it has an
ew =
2 encoder network placed before. For this reason, it is sometimes referred as a conditional language
model. The goal is to find a sentence y such that:
Remark: the individual components of the learned word embeddings are not necessarily inter-
pretable. y= arg max P (y <1> ,...,y <Ty > |x)
y <1> ,...,y <Ty >

2.4 Comparing words r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
• Step 1: Find top B likely words y <1>
w1 · w2
similarity = = cos(θ)
||w1 || ||w2 || • Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k−1>

• Step 3: Keep top B combinations x,y <1> ,...,y <k>


Remark: θ is the angle between words w1 and w2 .

r t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at re-


ducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly
used to visualize word vectors in the 2D space.

Stanford University 9 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search. Remark: the attention scores are commonly used in image captioning and machine translation.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Ty
1 X h i
Objective = log p(y <t> |x,y <1> , ..., y <t−1> )
Tyα
t=1

Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.

r Error analysis – When obtaining a predicted translation b y that is bad, one can wonder why r Attention weight – The amount of attention that the output y <t> should pay to the
0 0
we did not get a good translation y ∗ by performing the following error analysis: activation a<t > is given by α<t,t > computed as follows:
0
0 exp(e<t,t >)
Case P (y ∗ |x) > P (b
y |x) P (y ∗ |x) 6 P (b
y |x) α<t,t >
=
Tx
00
X
Root cause Beam search faulty RNN faulty exp(e<t,t >
)
- Try different architecture t00 =1

Remedies Increase beam width - Regularize Remark: computation complexity is quadratic with respect to Tx .
- Get more data

? ? ?
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is defined as follows:
n
!
1 X
bleu score = exp pk
n
k=1

where pn is the bleu score on n-gram only defined as follows:


X
countclip (n-gram)
n-gram∈y
pn =
b
X
count(n-gram)
n-gram∈y b
Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially
inflated bleu score.

2.7 Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
0
in practice. By noting α<t,t > the amount of attention that the output y <t> should pay to the
0
activation a <t > and c <t> the context at time t, we have:
0
> <t0 > 0
X X
c<t> = α<t,t a with α<t,t >
=1
t0 t0

Stanford University 10 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

3 Deep Learning Tips and Tricks 3.2 Training a neural network

3.1 Data processing 3.2.1 Definitions

r Data augmentation – Deep learning models usually need a lot of data to be properly trained. r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
It is often useful to get more data from the existing ones using data augmentation techniques. where the model sees the whole training set to update its weights.
The main ones are summed up in the table below. More precisely, given the following input
r Mini-batch gradient descent – During the training phase, updating weights is usually not
image, here are the techniques that we can apply:
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
Original Flip Rotation Random crop points in a batch is a hyperparameter that we can tune.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z)

- Random focus
- Flipped with respect - Rotation with on one part of
- Image without to an axis for which a slight angle the image
3.2.2 Finding optimal weights
any modification the meaning of the - Simulates incorrect - Several random
image is preserved horizon calibration crops can be r Backpropagation – Backpropagation is a method to update the weights in the neural network
done in a row by taking into account the actual output and the desired output. The derivative with respect
to each weight w is computed using the chain rule.

Color shift Noise addition Information loss Contrast change

Using this method, each weight is updated with the rule:


∂L(z,y)
w ←− w − α
- Nuances of RGB ∂w
- Addition of noise - Parts of image - Luminosity changes
is slightly changed
- More tolerance to ignored - Controls difference
- Captures noise r Updating weights – In a neural network, weights are updated as follows:
quality variation of - Mimics potential in exposition due
that can occur
inputs loss of parts of image to time of day • Step 1: Take a batch of training data and perform forward propagation to compute the
with light exposure
loss.
• Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi }.
• Step 3: Use the gradients to update the weights of the network.
By noting µB , σB
2 the mean and variance of that we want to correct to the batch, it is done as

follows:
xi − µB
xi ←− γ p +β
2 +
σB

It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.

Stanford University 11 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

3.3 Parameter tuning Method Explanation Update of w Update of b


- Dampens oscillations
3.3.1 Weights initialization Momentum - Improvement to SGD w − αvdw b − αvdb
- 2 parameters to tune

r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier - Root Mean Square propagation
dw db
initialization enables to have initial weights that take into account characteristics that are unique RMSprop - Speeds up learning algorithm w − α√ b ←− b − α √
to the architecture. by controlling oscillations
sdw sdb

r Transfer learning – Training a deep learning model requires a lot of data and more impor- - Adaptive Moment estimation
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets vdw vdb
Adam - Most popular method w − α√ b ←− b − α √
that took days/weeks to train, and leverage it towards our use case. Depending on how much sdw +  sdb + 
data we have at hand, here are the different ways to leverage this: - 4 parameters to tune

Remark: other methods include Adadelta, Adagrad and SGD.

Training size Illustration Explanation


3.4 Regularization

r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
Freezes all layers, data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
Small
trains weights on softmax much on particular sets of features.

Freezes most layers,


Medium trains weights on last
layers and softmax

Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
Trains weights on layers r Weight regularization – In order to make sure that the weights are not too large and that
Large and softmax by initializing the model is not overfitting the training set, regularization techniques are usually performed on
weights on pre-trained ones the model weights. The main ones are summed up in the table below:

LASSO Ridge Elastic Net


- Shrinks coefficients to 0 Tradeoff between variable
Makes coefficients smaller
- Good for variable selection selection and small coefficients

3.3.2 Optimizing convergence

r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.

r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
h i
the training time and improve the numerical optimal solution. While Adam optimizer is the ... + λ||θ||1 ... + λ||θ||22 ... + λ (1 − α)||θ||1 + α||θ||22
most commonly used technique, others can also be useful. They are summed up in the table λ∈R λ∈R
below: λ ∈ R,α ∈ [0,1]

Stanford University 12 Winter 2019


CS 230 – Deep Learning Shervine Amidi & Afshine Amidi

r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.

3.5 Good practices


r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.

Numerical gradient Analytical gradient


df f (x + h) − f (x − h) df
Formula (x) ≈ (x) = f 0 (x)
dx 2h dx
- Expensive; loss has to be
computed two times per dimension
- ’Exact’ result
- Used to verify correctness
Comments of analytical implementation - Direct computation
-Trade-off in choosing h
not too small (numerical instability) - Used in the final implementation
nor too large (poor gradient approx.)

? ? ?

Stanford University 13 Winter 2019


Cheat Sheet – Convolutional Neural Network
Convolutional Neural Network:
The data gets into the CNN through the input layer and passes
through various hidden layers before getting to the output layer.
The output of the network is compared to the actual labels in
terms of loss or error. The partial derivatives of this loss w.r.t the
trainable weights are calculated, and the weights are updated
through one of the various methods using backpropagation.

CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
output.
a. Convolutional Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
(Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
Fully Connected Layer Convolutional Layer
w11*x
x1 1+ b1
+ b1 y1
w21*x2
x2
b1
3+
1*x
x3 w3

Input Node Output Node Input Map Kernel Output Map

2. Pooling: Non-trainable layer to change the size of the feature map


a. Max/Average Pooling: Decrease the spatial size of the input layer based on
selecting the maximum/average value in receptive field defined by the kernel
b. UnPooling: A non-trainable layer used to increase the spatial size of the input
layer based on placing the input pixel at a certain index in the receptive field
of the output defined by the kernel.
3. Normalization: Usually used just before the activation functions to limit the
unbounded activation from increasing the output layer values too high
a. Local Response Normalization LRN: A non-trainable layer that square-normalizes the pixel values in a feature map
within a local neighborhood.
b. Batch Normalization: A trainable approach to normalizing the data by learning scale and shift variable during training.
3. Activation: Introduce non-linearity so CNN can 5. Loss function: Quantifies how far off the CNN prediction
efficiently map non-linear complex mapping. is from the actual labels.
a. Non-parametric/Static functions: Linear, ReLU a. Regression Loss Functions: MAE, MSE, Huber loss
b. Parametric functions: ELU, tanh, sigmoid, Leaky ReLU b. Classification Loss Functions: Cross entropy, Hinge loss
c. Bounded functions: tanh, sigmoid 4.0
MSE Loss
2.0
MAE Loss
2.0 Ω
Huber Loss
æ
mse = (x ° x̂)2 mae = |x ° x̂| 1
2 (x ° x̂)
2
: |x ° x̂| < ∞
3.5 1.75 1.75 ∞|x ° x̂| ° 12 ∞ 2 : else
∞ =1.9
3.0 1.5 1.5
2.5 1.25 1.25
2.0 1.0 1.0
1.5 0.75 0.75
1.0 0.5 0.5
0.5 0.25 0.25
0.0 0.0 0.0
-2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 1.0 2.0 -2.0 -1.0 0.0 1.0 2.0

Hinge Loss Cross Entropy Loss


1.0
3.0 Ω æ
max(0, 1 ° x̂) : x = 1 °ylog(p) ° (1 ° y)log(1 ° p)
2.5
max(0, 1 + x̂) : x = °1 8.0 0.8

2.0 6.0 0.6

1.5
4.0 0.4
1.0
2.0
0.5 0.2

0.0 0.0 0.0


-2.0 -1.0 0.0 1.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here


Cheat Sheet – Famous CNNs
AlexNet – 2012
Why: AlexNet was born out of the need to improve the results of
the ImageNet challenge.
What: The network consists of 5 Convolutional (CONV) layers and 3
Fully Connected (FC) layers. The activation used is the Rectified
Linear Unit (ReLU).
How: Data augmentation is carried out to reduce over-fitting, Uses
Local response localization.

VGGNet – 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.

ResNet – 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
‘XX’ denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
Filter
Concatenation
Weight layer

f(x) x 1x1
3x3
Conv
5x5
Conv
1x1 Conv

Weight layer Conv 1x1 1x1 3x3


Conv Conv Maxpool

+ Previous
f(x)+x Layer

Figure 1 ResNet Block Figure 2 Inception Block


Inception – 2014
Why: Lager kernels are preferred for more global features, on the other
hand, smaller kernels provide good results in detecting area-specific
features. For effective recognition of such a variable-sized feature, we
need kernels of different sizes. That is what Inception does.
What: The Inception network architecture consists of several inception
modules of the following structure. Each inception module consists of
four operations in parallel, 1x1 conv layer, 3x3 conv layer, 5x5 conv
layer, max pooling
How: Inception increases the network space from which the best
network is to be chosen via training. Each inception module can
capture salient features at different levels.

Source: https://www.cheatsheets.aqeel-anwar.com Tutorial: Click here


5. Reinforcement learning
Reinforcement Learning Cheat Sheet Optimal Similarly, we can do the same for the Q function:

Agent-Environment Interface V∗ (s) = max Vπ (s) (5)


π
 
qπ (s, a) = Eπ Gt |St = s, At = a
Action-Value (Q) Function  ∞ 
We can also denoted the expected reward for state, action γ k Rt+k+1 |St = s, At = a
P
= Eπ
pairs. k=0
 
qπ (s, a) = Eπ Gt |St = s, At = a (6)  ∞ 
γ k Rt+k+2 |St = s, At = a
P
= Eπ Rt+1 + γ
Optimal k=0
 ∞ 
The optimal value-action function:
γ Rt+k+2 |St+1 = s0
P k
p(s0 , r|s, a) r + γEπ
P
=
The Agent at each step t receives a representation of the s0 ,r k=0
environment’s state, St ∈ S and it selects an action At ∈ A(s). q∗ (s, a) = max q π (s, a) (7)
π
p(s0 , r|s, a) r + γVπ (s0 )
 
Then, as a consequence of its action the agent receives a
P
=
reward, Rt+1 ∈ R ∈ R. Clearly, using this new notation we can redefine V ∗ , equation s0 ,r
5, using q ∗ (s, a), equation 7: (10)
Policy
A policy is a mapping from a state to an action V∗ (s) = max qπ∗ (s, a) (8)
πt (s|a) (1)
a∈A(s) Dynamic Programming
That is the probability of select an action At = a if St = s. Intuitively, the above equation express the fact that the value Taking advantages of the subproblem structure of the V and Q
of a state under the optimal policy must be equal to the function we can find the optimal policy by just planning
Reward expected return from the best action from that state.
The total reward is expressed as:
Policy Iteration
H
X
k
Bellman Equation
Gt = γ rt+k+1 (2) We can now find the optimal policy
k=0
An important recursive property emerges for both Value (4)
and Q (6) functions if we expand them. 1. Initialisation
Where γ is the discount factor and H is the horizon, that can
V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S,
be infinite.
Value Function ∆←0
Markov Decision Process   2. Policy Evaluation
Vπ (s) = Eπ Gt |St = s while ∆ ≥ θ (a small positive number) do
A Markov Decision Process, MPD, is a 5-tuple
(S, A, P, R, γ) where:  ∞  foreach s ∈ S do
= Eπ
P
γ k Rt+k+1 |St = s v ← V (s)
finite set of states:
p(s0 , r|s, a) r + γV (s0 )
P P  
k=0 V (s) ← π(a|s)
s∈S a s0 ,r
finite set of actions:  ∞ 
∆ ← max(∆, |v − V (s)|)
γ k Rt+k+2 |St = s
P
a∈A = Eπ Rt+1 + γ
(3) k=0 end
state transition probabilities: end
p(s0 |s, a) = P r{St+1 = s0 |St = s, At = a} X XX
= π(a|s) p(s0 , r|s, a) (9) 3. Policy Improvement
expected reward for state-action-nexstate: policy-stable ← true
a s0 r
r(s0 , s, a) = E[Rt+1 |St+1 = s0 , St = s, At = a] | {z } foreach s ∈ S do
Sum of all
 probabilities ∀ possible r
Value Function  ∞  old-action ← π(s)P
p(s0 , r|s, a) r + γV (s0 )
 
kR s0 π(s) ← argmax
P
Value function describes how good is to be in a specific state s r + γ Eπ γ t+k+2 |St+1 =
 k=0
 a s0 ,r
under a certain policy π. For MDP:  }
| {z policy-stable ← old-action = π(s)
Vπ (s) = E[Gt |St = s] (4) Expected reward from st+1
end
Informally, is the expected return (expected cumulative if policy-stable return V ≈ V∗ and π ≈ π∗ , else go to 2
p(s0 , r|s, a) r + γVπ (s0 )
P PP  
discounted reward) when starting from s and following π = π(a|s) Algorithm 1: Policy Iteration
a s0 r
Value Iteration Sarsa The following algorithm gives a generic implementation.
Sarsa (State-action-reward-state-action) is a on-policy TD Initialise Q(s, a) arbitrarily and
We can avoid to wait until V (s) has converged and instead do
control. The update rule: Q(terminal − state, ) = 0
policy improvement and truncated policy evaluation step in
foreach episode ∈ episodes do
one operation
while s is not terminal do
Initialise V (s) ∈ R, e.gV (s) = 0 Choose a from s using policy derived from Q
Q(st , at ) ← Q(st , at ) + α [rt + γQ(st+1 , at+1 ) − Q(st , at )]
∆←0 (e.g., -greedy)
while ∆ ≥ θ (a small positive number) do Take action a, observer r, s0
foreach s ∈ S do n-step Sarsa Q(s, a) ← h
v ← V (s) 0 0
i
Define the n-step Q-Return Q(s, a) + α r + γ max Q(s , a ) − Q(s, a)
p(s0 , r|s, a) r + γV (s0 )
 
a0
P
V (s) ← max
a s0 ,r s ← s0
∆ ← max(∆, |v − V (s)|) end
q (n) = Rt+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(St+n ) end
end
end
n-step Sarsa update Q(S, a) towards the n-step Q-return
Algorithm 5: Q Learning
ouput: Deterministic policy π ≈ π∗ such that
π(s) = argmax
P
p(s0 , r|s, a) r + γV (s0 ) h
(n)
i Deep Q Learning
a s0 ,r Q(st , at ) ← Q(st , at ) + α qt − Q(st , at ) Created by DeepM ind, Deep Q Learning, DQL, substitutes
Algorithm 2: Value Iteration the Q function with a deep neural network called Q-network.
Forward View Sarsa(λ) It also keep track of some observation in a memory in order to
use them to train the network.
Monte Carlo Methods ∞
X (n)
qtλ = (1 − λ) λn−1 qt (r + γ max Q(s0 , a0 ; θi−1 ) − Q(s, a; θi ))2
 
Monte Carlo (MC) is a Model Free method, It does not require n=1 Li (θi ) = E(s,a,r,s0 )∼U (D) |
 a
{z } | {z } 
complete knowledge of the environment. It is based on target
prediction
Forward-view Sarsa(λ):
averaging sample returns for each state-action pair. The (13)
following algorithm gives the basic implementation h i
Where θ are the weights of the network and U (D) is the
Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at )
Initialise for all s ∈ S, a ∈ A(s) : experience replay history.
Q(s, a) ← arbitrary Initialise replay memory D with capacity N
Initialise Q(s, a) arbitrarily and
π(s) ← arbitrary Initialise Q(s, a) arbitrarily
Q(terminal − state, ) = 0
Returns(s, a) ← empty list foreach episode ∈ episodes do
foreach episode ∈ episodes do
while forever do while s is not terminal do
Choose a from s using policy derived from Q (e.g.,
Choose S0 ∈ S and A0 ∈ A(S0 ), all pairs have With probability  select a random action
-greedy)
probability > 0 a ∈ A(s)
while s is not terminal do
Generate an episode starting at S0 , A0 following π otherwise select a = maxa Q(s, a; θ)
Take action a, observer r, s0
foreach pair s, a appearing in the episode do Take action a, observer r, s0
Choose a0 from s0 using policy derived from Q
G ← return following the first occurrence of s, a Store transition (s, a, r, s0 ) in D
(e.g., -greedy)
Append G to Returns(s, a)) Sample random minibatch of transitions
Q(s, a) ← Q(s, a) + α r + γQ(s0 , a0 ) − Q(s, a)
 
Q(s, a) ← average(Returns(s, a)) 0 (sj , aj , rj , s0j ) from D
s←s
end Set yj ←
a ← a0
for terminal s0j
(
foreach s in the episode do end rj
π(s) ← argmax Q(s, a)
a end rj + γ max Q(s , a ; θ) for non-terminal s0j
0 0
a
end Algorithm 4: Sarsa(λ) Perform gradient descent step on
end (yj − Q(sj , aj ; Θ))2
Algorithm 3: Monte Carlo first-visit Temporal Difference - Q Learning s ← s0
end
For non-stationary problems, the Monte Carlo estimate for, Temporal Difference (TD) methods learn directly from raw end
e.g, V is: experience without a model of the environment’s dynamics.
Algorithm 6: Deep Q Learning

V (St ) ← V (St ) + α Gt − V (St )

(11) TD substitutes the expected discounted reward Gt from the
episode with an estimation: Copyright
c 2018 Francesco Saverio Zuppichini
Where α is the learning rate, how much we want to forget   https://github.com/FrancescoSaverioZuppichini/Reinforcement-
about past experiences. V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St ) (12) Learning-Cheat-Sheet
6. Programing languages
6.1 Python language
Alvaro Sebastian
Python 3 Beginner's Reference Cheat Sheet http://www.sixthresearcher.com

Main data types List operations List methods


boolean = True / False list = [] defines an empty list list.append(x) adds x to the end of the list
integer = 10
list[i] = x stores x with index i list.extend(L) appends L to the end of the list
float = 10.01
list[i] retrieves the item with index I list.insert(i,x) inserts x at i position
string = “123abc”
list[-1] retrieves last item list.remove(x) removes the first list item whose
list = [ value1, value2, … ]
list[i:j] retrieves items in the range i to j value is x
dictionary = { key1:value1, key2:value2, …}
del list[i] removes the item with index i list.pop(i) removes the item at position i and
returns its value
Numeric Comparison list.clear() removes all items from the list
operators operators Dictionary operations
list.index(x) returns a list of values delimited
+ addition == equal dict = {} defines an empty dictionary by x
- subtraction != different list.count(x) returns a string with list values
dict[k] = x stores x associated to key k
* multiplication joined by S
> higher dict[k] retrieves the item with key k
/ division list.sort() sorts list items
< lower del dict[k] removes the item with key k
** exponent list.reverse() reverses list elements
% modulus >= higher or equal
list.copy() returns a copy of the list
// floor division <= lower or equal String methods

Boolean Special string.upper() converts to uppercase Dictionary methods


operators characters string.lower() converts to lowercase
string.count(x) counts how many dict.keys() returns a list of keys
and logical AND # coment dict.values() returns a list of values
times x appears
or logical OR \n new line string.find(x) position of the x first dict.items() returns a list of pairs (key,value)
not logical NOT \<char> scape char occurrence dict.get(k) returns the value associtated to
string.replace(x,y) replaces x for y the key k
string.strip(x) returns a list of values dict.pop() removes the item associated to
String operations the key and returns its value
delimited by x
string.join(L) returns a string with L dict.update(D) adds keys-values (D) to dictionary
string[i] retrieves character at position i dict.clear() removes all keys-values from the
values joined by string
string[-1] retrieves last character dictionary
string.format(x) returns a string that
string[i:j] retrieves characters in range i to j includes formatted x dict.copy() returns a copy of the dictionary

Legend: x,y stand for any kind of data values, s for a string, n for a number, L for a list where i,j are list indexes, D stands for a dictionary and k is a dictionary key.
Alvaro Sebastian
Python 3 Beginner's Reference Cheat Sheet http://www.sixthresearcher.com

Built-in functions Conditional


Loops Functions
statements
print(x, sep='y') prints x objects separated by y if <condition> : while <condition>: def function(<params>):
<code> <code> <code>
input(s) prints s and waits for an input else if <condition> :
that will be returned return <data>
<code> for <variable> in <list>:
len(x) returns the length of x (s, L or D) … <code>
else: Modules
min(L) returns the minimum value in L <code> for <variable> in
max(L) returns the maximum value in L range(start,stop,step): import module
if <value> in <list>: <code> module.function()
sum(L) returns the sum of the values in L
for key, value in from module import *
range(n1,n2,n) returns a sequence of numbers
from n1 to n2 in steps of n Data validation dict.items(): function()
<code>
abs(n) returns the absolute value of n try:
Reading and
<code>
round(n1,n) returns the n1 number rounded Loop control writing files
to n digits except <error>:
statements
<code> f = open(<path>,‘r')
type(x) returns the type of x (string, float, else: break finishes loop f.read(<size>)
list, dict …) <code> execution f.readline(<size>)
continue jumps to next f.close()
str(x) converts x to string iteration
Working with files pass does nothing
list(x) converts x to a list f = open(<path>,’r’)
and folders
for line in f:
int(x) converts x to a integer number
Running external <code>
import os
float(x) converts x to a float number programs f.close()
os.getcwd()
os.makedirs(<path>) f = open(<path>,'w')
help(s) prints help about x
os.chdir(<path>) import os
f.write(<str>)
map(function, L) Applies function to values in L os.listdir(<path>) os.system(<command>)
f.close()

Legend: x,y stand for any kind of data values, s for a string, n for a number, L for a list where i,j are list indexes, D stands for a dictionary and k is a dictionary key.
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file:
pickled_data = pickle.load(file)
>>> df_sheet2 = data.parse('1960-1966',
Learn Python for data science Interactively at www.DataCamp.com skiprows=[0],
names=['Country',
'AAM: War(2002)'])
>>> df_sheet1 = data.parse(0, HDF5 Files
parse_cols=[0],
Importing Data in Python skiprows=[0], >>> import h5py
>>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5'
names=['Country'])
Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r')
your data: To access the sheet names, use the sheet_names attribute:
>>> import numpy as np >>> data.sheet_names
>>> import pandas as pd Matlab Files
Help SAS Files >>> import scipy.io
>>> filename = 'workspace.mat'
>>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename)
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file:
df_sas = file.to_data_frame()

Text Files Exploring Dictionaries


Stata Files Accessing Elements with Functions
Plain Text Files >>> data = pd.read_stata('urbanpop.dta') >>> print(mat.keys()) Print dictionary keys
>>> filename = 'huck_finn.txt' >>> for key in data.keys(): Print dictionary keys
>>> file = open(filename, mode='r') Open the file for reading print(key)
>>> text = file.read() Read a file’s contents Relational Databases meta
quality
>>> print(file.closed) Check whether file is closed
>>> from sqlalchemy import create_engine strain
>>> file.close() Close file
>>> print(text) >>> engine = create_engine('sqlite://Northwind.sqlite') >>> pickled_data.values() Return dictionary values
>>> print(mat.items()) Returns items in list format of (key, value)
Use the table_names() method to fetch a list of table names: tuple pairs
Using the context manager with
>>> with open('huck_finn.txt', 'r') as file:
>>> table_names = engine.table_names() Accessing Data Items with Keys
print(file.readline()) Read a single line
print(file.readline()) Querying Relational Databases >>> for key in data ['meta'].keys() Explore the HDF5 structure
print(file.readline()) print(key)
>>> con = engine.connect() Description
>>> rs = con.execute("SELECT * FROM Orders") DescriptionURL
Table Data: Flat Files >>> df = pd.DataFrame(rs.fetchall()) Detector
>>> df.columns = rs.keys() Duration
GPSstart
Importing Flat Files with numpy >>> con.close()
Observatory
Files with one data type Using the context manager with Type
UTCstart
>>> filename = ‘mnist.txt’ >>> with engine.connect() as con:
>>> print(data['meta']['Description'].value) Retrieve the value for a key
>>> data = np.loadtxt(filename, rs = con.execute("SELECT OrderID FROM Orders")
delimiter=',', String used to separate values df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
skiprows=2,
usecols=[0,2],
Skip the first 2 lines
Read the 1st and 3rd column
Navigating Your FileSystem
dtype=str) The type of the resulting array Querying relational databases with pandas
Magic Commands
Files with mixed data types >>> df = pd.read_sql_query("SELECT * FROM Orders", engine)
>>> filename = 'titanic.csv' !ls List directory contents of files and directories
>>> data = np.genfromtxt(filename, %cd .. Change current working directory
%pwd Return the current working directory path
delimiter=',',
names=True, Look for column header
Exploring Your Data
dtype=None)
NumPy Arrays os Library
>>> data_array = np.recfromcsv(filename) >>> data_array.dtype Data type of array elements >>> import os
>>> data_array.shape Array dimensions >>> path = "/usr/tmp"
The default dtype of the np.recfromcsv() function is None. >>> wd = os.getcwd() Store the name of current directory in a string
>>> len(data_array) Length of array
>>> os.listdir(wd) Output contents of the directory in a list
Importing Flat Files with pandas >>> os.chdir(path) Change current working directory
pandas DataFrames >>> os.rename("test1.txt", Rename a file
>>> filename = 'winequality-red.csv' "test2.txt")
>>> data = pd.read_csv(filename, >>> df.head() Return first DataFrame rows
nrows=5, >>> os.remove("test1.txt") Delete an existing file
Number of rows of file to read >>> df.tail() Return last DataFrame rows >>> os.mkdir("newdir") Create a new directory
header=None, Row number to use as col names >>> df.index Describe index
sep='\t', Delimiter to use >>> df.columns Describe DataFrame columns
comment='#', Character to split comments >>> df.info() Info on DataFrame
na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp
Learn R for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com
Getting
>>> s['b'] Get one element Sort & Rank
-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df.sort_values(by='Country') Sort by the values along an axis
>>> df[1:] Get subset of a DataFrame
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Selecting, Boolean Indexing & Setting Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.iat([0],[0])
>>>
>>>
df.info()
df.count()
Info on DataFrame
Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5
'Belgium' column labels >>> df.cumsum() Cummulative sum of values
>>> df.min()/df.max() Minimum/maximum values
c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax()
Index Minimum/Maximum index value
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
By Label/Position >>> df.median() Median of values
>>> df.ix[2] Select single row of
DataFrame Country
Capital
Brazil
Brasília
subset of rows Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data, Setting
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv('file.csv', header=None, nrows=5) >>> from sqlalchemy import create_engine >>> s.add(s3, fill_value=0)
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel('file.xlsx') >>> pd.read_sql_query("SELECT * FROM my_table;", engine) >>> s.sub(s3, fill_value=2)
>>> pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1') >>> s.div(s3, fill_value=4)
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file >>> s.mul(s3, fill_value=3)
read_sql_query()
>>> xlsx = pd.ExcelFile('file.xls')
>>> df = pd.read_excel(xlsx, 'Sheet1') >>> pd.to_sql('myDf', engine) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Advanced Indexing Also see NumPy Arrays Combining Data
Selecting data1 data2
Pandas >>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1 X2 X1 X3
Learn Python for Data Science Interactively at www.DataCamp.com >>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Reshaping Data >>> df3.filter(items=”a”,”b”]) Filter on values
Merge
>>> df.select(lambda x: not x%5) Select specific elements
Pivot Where X1 X2 X3
>>> pd.merge(data1,
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> s.where(s > 0) Subset the data data2, a 11.432 20.784
columns='Type', Query how='left',
values='Value') b 1.303 NaN
>>> df6.query('second > first') Query DataFrame on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c Setting/Resetting Index >>> pd.merge(data1, X1 X2 X3


1 2016-03-02 b 13.031 Date data2, a 11.432 20.784
>>> df.set_index('Country') Set the index
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() Reset the index b 1.303 NaN
on='X1')
3 2016-03-03 a 99.906 >>> df = df.rename(index=str, Rename DataFrame d NaN 20.784
2016-03-02 1.303 13.031 NaN columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,
how='inner', a 11.432 20.784
Pivot Table Reindexing on='X1') b 1.303 NaN
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, Spread rows into columns X1 X2 X3
values='Value', Forward Filling Backward Filling >>> pd.merge(data1,
index='Date', data2, a 11.432 20.784
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill') on='X1') c 99.906 NaN
Stack / Unstack Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> stacked = df5.stack() Pivot a level of column labels 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() Pivot a level of index labels 2 Brazil Brasília 207847528 3 3 Join
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 MultiIndexing Concatenate
2 4 0.184713 0.237102 2 4 0 0.184713
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])] Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) >>> s.append(s2)
Unstacked 3 3 0 0.433522
>>> tuples = list(zip(*arrays)) Horizontal/Vertical
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
Stacked names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Melt >>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, Gather columns into rows
Dates
id_vars=["Date"],
value_vars=["Type", "Value"],
Duplicate Data >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() Return unique values periods=6,
>>> df2.duplicated('Type') Check duplicates freq='M')
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
0 2016-03-01 Type a >>> df2.drop_duplicates('Type', keep='last') Drop duplicates >>> index = pd.DatetimeIndex(dates)
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() Check index duplicates >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a Grouping Data Visualization Also see Matplotlib
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c Aggregation >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> s.plot() >>> df2.plot()
>>> df4.groupby(level=0).sum()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
8 2016-03-01 Value 20.784 'b': np.sum})
9 2016-03-03 Value 99.906 Transformation
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

Iteration Missing Data


>>> df.dropna() Drop NaN values
>>> df.iteritems() (Column-index, Series) pairs >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 1 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>>
>>>
plt.plot(x,y,linewidth=4.0)
plt.plot(x,y,ls='solid')
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks


>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> fig, ax = plt.subplots() >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[0,1].streamplot(X,Y,U,V) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width)
>>>
>>>
>>>
axes[1,0].barh([0.5,1,2.5],[0,1,2])
axes[1,1].axhline(0.45)
axes[0,1].axvline(0.65)
Plot horiontal rectangles (constant height)
Draw a horizontal line across axes
Draw a vertical line across axes
Data Distributions
>>> ax1.hist(y) Plot a histogram
6 Show Plot
>>> plt.show()
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Scatter Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Plotting With Bokeh Line Glyphs Tabbed Layout
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: US
Colormapping >>> layout = row(p4,p5)
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Asia
Europe

Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')

Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row

2 Plotting >>> layout = row(p1,p2,p3)


Columns
5 Show or Save Your Plots
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
x_range=(0, 8), y_range=(0, 8)) >>>layout = row(column(p1,p2), p3) DataCamp
>>> p3 = figure() Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
SciPy >>> B = np.asmatrix(b) Subtraction
>>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
scientific computing that provides mathematical >>> np.divide(A,D) Division
Basic Matrix Routines Multiplication
algorithms and convenience functions built on the
>>> np.multiply(D,A) Multiplication
NumPy extension of Python. Inverse >>> np.dot(A,D) Dot product
>>> A.I Inverse >>> np.vdot(A,D) Vector dot product
>>> linalg.inv(A) Inverse
Interacting With NumPy Also see NumPy >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> import numpy as np >>> A.H Conjugate transposition >>> np.tensordot(A,D) Tensor dot product
>>> a = np.array([1,2,3]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Norm Exponential Functions
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A) Matrix exponential
Index Tricks >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.mgrid[0:5,0:5] Create a dense meshgrid decomposition)
>>> np.ogrid[0:2,0:2] Create an open meshgrid Rank Logarithm Function
>>> np.r_[[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
>>> np.c_[b,c] Create stacked column-wise arrays Determinant Trigonometric Tunctions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
Shape Manipulation Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> np.transpose(b) Permute array dimensions >>> linalg.solve(A,b) Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> b.flatten() Flatten the array >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.lstsq(D,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) equation >>> linalg.coshm(D) Hyperbolic matrix cosine
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.sigm(A) Matrix sign function
Polynomials >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix
>>> from numpy import poly1d (SVD) Matrix Square Root
>>> linalg.sqrtm(A) Matrix square root
>>> p = poly1d([3,4,5]) Create a polynomial object
Creating Sparse Matrices Arbitrary Functions
Vectorizing Functions >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix
>>> def myfunc(a):
if a < 0: >>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a*2 >>> C[C > 0.5] = 0
else: >>> H = sparse.csr_matrix(C)
return a/2
Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> np.vectorize(myfunc) Vectorize functions >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
>>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
Type Handling >>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> np.real(c) Return the real part of the array elements
>>> np.imag(c) Return the imaginary part of the array elements Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Singular Value Decomposition
>>> np.cast['f'](np.pi) Cast object to a data type Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> sparse.linalg.norm(I) Norm LU Decomposition
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> P,L,U = linalg.lu(C) LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
Solving linear problems
(number of samples) >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> g [3:] += np.pi
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale) Sparse Matrix Functions
>>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
conditions
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))

>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF

Writing Code And Text


Code and text are encapsulated by 3 basic cell types: markdown cells, code
cells, and raw NBConvert cells.
Edit Cells Edit Mode: 1. Save and checkpoint 9. Interrupt kernel
2. Insert cell below 10. Restart kernel
3. Cut cell 11. Display characteristics
Cut currently selected cells Copy cells from 4. Copy cell(s) 12. Open command palette
to clipboard clipboard to current 5. Paste cell(s) below 13. Current kernel
cursor position 6. Move cell up 14. Kernel status
Paste cells from Executing Cells 7. Move cell down 15. Log out from notebook server
clipboard above Paste cells from 8. Run current cell
current cell Run selected cell(s) Run current cells down
clipboard below
and create a new one
Paste cells from current cell
below Asking For Help
clipboard on top Run current cells down
Delete current cells
of current cel and create a new one Walk through a UI tour
Split up a cell from above Run all cells
Revert “Delete Cells” List of built-in keyboard
current cursor Run all cells above the Run all cells below
invocation shortcuts
position current cell the current cell Edit the built-in
Merge current cell Merge current cell keyboard shortcuts
Change the cell type of toggle, toggle Notebook help topics
with the one above with the one below current cell scrolling and clear Description of
Move current cell up Move current cell toggle, toggle current outputs markdown available Information on
down scrolling and clear in notebook unofficial Jupyter
Adjust metadata
underlying the Find and replace all output Notebook extensions
Python help topics
current notebook in selected cells IPython help topics
View Cells
Remove cell Copy attachments of NumPy help topics
attachments current cell Toggle display of Jupyter SciPy help topics
Toggle display of toolbar Matplotlib help topics
Paste attachments of Insert image in logo and filename
SymPy help topics
current cell selected cells Toggle display of cell Pandas help topics
action icons:
Insert Cells - None About Jupyter Notebook
- Edit metadata
Toggle line numbers - Raw cell format
Add new cell above the Add new cell below the - Slideshow
current one in cells - Attachments
current one DataCamp
- Tags
Learn Python for Data Science Interactively
​ Jupyter Notebook Markdown Cheatsheet​ ​ From ​SqlBak.com​ with 💙
<a id="anchor"></a>
#⌴Header 1 [Go to anchor](#anchor)
Header 1 #⌴Top Header
Go to anchor
Header 1
======== [Go to header](#Top-Header)

##⌴Header 2 https://sqlbak.com
Header 2
Header 2 [Link](https://sqlbak.com
-------- "optional title") Link
Click [here][id]
###⌴Header 3 Header 3 [id]:https://sqlbak.com

####⌴Header 4 Header 4 > blockquote text


blockquote text
#####⌴Header 5 Header 5
```python
*italics* italics print('hello');
_italics_ ``` print​(​'hello'​);
\*literal asterisks\* *literal asterisks* `inline_code();`

**bold** bold |Left |Center|Right|


__bold__ |:-----|:----:|----:|
|1 |A |C | Left Center Right
~~strikethrough~~ strikethrough |2 |B |D |
1 A C
1.⌴First item 1. First item
2. Second item 2 B D
2.⌴Second item
⌴1.⌴Subitem A. Subitem

*⌴Item 1 ● Item 1 ![alt text](logo.png "Title")


⌴Indent Indent
-⌴Item 2 ● Item 2 ![][id]
⌴+⌴Item 3 ■ Item 3 [id]:logo.png "Title"

- [x] Done ● ☑ Done $$\sqrt{k}$$ √k


- [ ] To do ● ▯To do Inline: $\sqrt{k}$

A A Line [![Img Alt


Line⌴⌴
Break
Break Text](http://img.youtube.com/vi/
aZCXOw707nc​/0.jpg)](https://yout
u.be/​aZCXOw707nc​ "Video Title")
--- ___________________________
* * * _
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Handling Text Sentence Parsing

text=​'Some words' assign string g=nlt​k.d​ata.lo​ad(​'gr​amm​ar.c​fg') Load a grammar from a file

list(​text) Split text into character tokens g=nlt​k.C​FG.f​ro​mst​rin​g("""..."" Manually define grammar

set(t​ext) Unique tokens ")

len(t​ext) Number of characters parse​r=n​ltk.Ch​art​Par​ser(g) Create a parser out of the


grammar

trees​=pa​rse​r.p​ars​e_a​ll(​text)
Accessing corpora and lexical resources
for tree in trees: ... print tree
from nltk.c​orpus import import Corpus​Reader object
brown from nltk.c​orpus import treebank

brown.wo​rds​(te​xt_id) Returns pretok​enised document as list treeb​ank.pa​rse​d_s​ent​s('​wsj​_00​0 Treebank parsed sentences


of words 1.m​rg')

brown.fi​lei​ds() Lists docs in Brown corpus


Text Classi​fic​ation
brown.ca​teg​ori​es() Lists categories in Brown corpus
from sklear​n.f​eat​ure​_ex​tra​cti​on.text import
Tokeni​zation CountV​ect​orizer, TfidfV​ect​orizer

text.s​pl​it(​" ") Split by space vect=​Cou​ntV​ect​ori​zer​().f​it​(X_​tr Fit bag of words model to


ain) data
nltk.w​or​d_t​oke​niz​er(​text) nltk in-built word tokenizer
vect.g​et​_fe​atu​re_​nam​es() Get features
nltk.s​en​t_t​oke​niz​e(doc) nltk in-built sentence tokenizer
vect.t​ra​nsf​orm​(X_​train) Convert to doc-term matrix

Lemmat​ization & Stemming


Entity Recogn​ition (Chunk​ing​/Ch​inking)
input​="List listed lists listing Different suffixes
listin​gs" g="NP: {<D​T>?​<JJ​>*<​NN>​}" Regex chunk grammar

words​=in​put.lo​wer​().s​plit(' ') Normalize (lower​case) cp=nl​tk.R​eg​exp​Par​ser(g) Parse grammar


words
ch=cp.pa​rse​(po​s_s​ent) Parse tagged sent. using grammar
porte​r=n​ltk.Po​rte​rSt​emmer Initialise Stemmer
print​(ch) Show chunks
[port​er.s​tem(t) for t in words] Create list of stems
ch.dr​aw() Show chunks in IOB tree
WNL=n​ltk.Wo​rdN​etL​emm​ati​zer() Initialise WordNet
cp.ev​alu​ate​(te​st_​sents) Evaluate against test doc
lemmatizer
sents​=nl​tk.c​or​pus.tr​eeb​ank.ta​gge​d_s​ents()
[WNL.l​em​mat​ize(t) for t in words] Use the lemmatizer
print​(nl​tk.n​e_​chu​nk(​sent)) Print chunk tree
Part of Speech (POS) Tagging

nltk.h​el​p.u​pen​n_t​ags​et Lookup definition for a POS tag


(​'MD')

nltk.p​os​_ta​g(w​ords) nltk in-built POS tagger

<use an altern​ative tagger to illustrate


ambigu​ity>

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com


cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 1 of 2. http://crosswordcheats.com
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

RegEx with Pandas & Named Groups

df=pd.Da​taF​ram​e(t​ime​_sents, column​s=[​'te​xt'])

df['t​ext​'].s​tr.sp​lit​().s​tr.len()

df['t​ext​'].s​tr.co​nta​ins​('w​ord')

df['t​ext​'].s​tr.co​unt​(r'​\d')

df['t​ext​'].s​tr.fi​nda​ll(​r'\d')

df['t​ext​'].s​tr.re​pla​ce(​r'​\w+d​ay\b', '???')

df['t​ext​'].s​tr.re​pla​ce(​r'(​\w)', lambda x: x.grou​ps(​)


[0​][:3])

df['t​ext​'].s​tr.ex​tra​ct(​r'(​\d?​\d)​:(​\d\d)')

df['t​ext​'].s​tr.ex​tra​cta​ll(​r'(​(\d​?\d​):(​\d\d) ?
([ap]​m))')

df['t​ext​'].s​tr.ex​tra​cta​ll(​r'(​?P<​dig​its​>\d)')

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com


cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 2 of 2. http://crosswordcheats.com
spaCy Cheat Sheet
by Nuozhi via cheatography.com/122797/cs/22963/

Init Phrase matching Extension attribute types

from spacy.lang.en import matcher = attribute Token.se​t_e​xte​nsi​‐


English spacy.matcher.PhraseMatcher(nlp.vocab) on(​"​ATT​R", defaut​‐
nlp = English() pattern = nlp("Golden Retrie​ver​") =Bool)
matche​r.a​dd(​"​DOG​", None, pattern) property Span.s​et​_ex​ten​sio​‐
Basic for match_id, start, end in matche​‐ n("P​ROP​", getter​=fn)
doc = nlp("SOME TEXTS") r(doc):
method Doc.s​et_​ext​ens​ion​‐
span = doc[i:j] ​ ​ ​ span = doc[st​art​:end]
("ME​THO​D", method​‐
token = doc[i] =fn)
Similarity

Pre-tr​ained Model word vector token.v​ector Boost up


nlp = Doc doc1.s​im​ila​rit​y(d​oc2) nlp.pipe(DATA)
spacy.load('en_core_web_sm') Span span1.si​mil​ari​ty(​span2)
doc = nlp(MY​_TEXT)
Token token​1.s​imi​lar​ity​(to​ken2) Passing in context

Doc by doc.s​imi​lar​ity​(to​ken) data = [ ("SOME TEXTS",


Name entity
Token {"KEY": "VAL"}), (...), ]
doc.ents # Method 1
return a similarity score 0~1
.text NOT for small model
for doc, ctx in nlp.pi​pe(​‐
.label_ cosine similarity by default data, as_tup​le=​True):
​ ​ ​ ​print( doc.ATTR,
spacy.t​okens Pipeline ctx[KEY] )

Doc Doc(n​lp.v​ocab, words=​‐ # Method 2

words, spaces = spaces) Doc.se​t_e​xte​nsi​on(​"​KEY​",


defaul​t=None)
Span Span(doc, i, j, label=​"​‐
for doc, ctx in nlp.pi​pe(​‐
PER​SON​") nlp.p​ipe​_names
data, as_tup​les​=True):
index: i, j nlp.p​ipe​line
​ ​ ​ ​doc._.KEY = ctx["KE​Y"]
words: a collection of words
spaces: a collecture of booleans Add pipeline component
Using tokenizer only
def fn(doc):
# Method 1
Matcher
​ ​ ​ # function body
doc = nlp.ma​ke_​doc​("SOME
matcher = ​ ​ ​ ​return doc
TEXTS")
spacy.matcher.Matcher(nlp.vocab) nlp.ad​d_p​ipe(fn, last, first, before,
# Method 2
matches = matche​r(doc) after)
with nlp.di​sab​le_​pip​es(​"​tag​‐
[(id, start, end)] ger​", "​par​ser​"):
Set custom attributes
​ ​ ​ doc = nlp(text)
Add pattern to matcher add doc._.ATTR = "​ATT​RIBUTE

pattern = [ { key: value } ] metadata NAME"

matche​r.a​dd(​"​PAT​TER​N_N​AME​", register Doc.s​et_​ext​ens​ion​("AT​TR",


None, pattern) globally defaul​t=N​one)

Two types of key: set to doc, tokens, spans


1. regex pattern access property via ._
2. label (i.e. POS, entity)

By Nuozhi Not published yet. Sponsored by ApolloPad.com


cheatography.com/nuozhi/ Last updated 24th May, 2020. Everyone has a novel in them. Finish
Page 1 of 1. Yours!
https://apollopad.com
Basic plots Scales API Tick locators API Animation API

756 plot([X],Y,[fmt],…) API ax.set_[xy]scale(scale,…) from matplotlib import ticker import matplotlib.animation as mpla
Cheat sheet 432 Version 3.5.0 X, Y, fmt, color, marker, linestyle linear
0.0 log
0.0 ax.[xy]axis.set_[minor|major]_locator(locator)

1 - + any values
2.5 0 + values > 0
2.510102101 0logit ticker.NullLocator() T = np.linspace(0, 2*np.pi, 100)

Quick start 765 1234567 X,scatter(X,Y,…) API API


0.0 2 0 2 symlog 0.0 ticker.MultipleLocator(0.5)
S = np.sin(T)
line, = plt.plot(T, S)
423 Y, [s]izes, [c]olors, marker, cmap
-
2.5 1000100any values2.5 1 0 < values < 1
+ 0 1
0.0 0.5 1.0
ticker.FixedLocator([0, 1, 5])
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
def animate(i):
import numpy as np
1 2
0 1 5 line.set_ydata(np.sin(T+i/50))
756 1234567 x,bar[h](x,height,…)
import matplotlib as mpl ticker.LinearLocator(numticks=3) anim = mpla.FuncAnimation(
import matplotlib.pyplot as plt API Projections API 0.0 2.5 5.0

432
plt.gcf(), animate, interval=5)
height, width, bottom, align, color ticker.IndexLocator(base=0.5, offset=0.25)
plt.show()
1 subplot(…,projection=p) 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
ticker.AutoLocator()
X = np.linspace(0, 2*np.pi, 100)
765 1234567 imshow(Z,…) API
p=’polar’ p=’3d’ 0 1 2 3 4 5
Styles API

3421
Y = np.cos(X) ticker.MaxNLocator(n=4)
Z, cmap, interpolation, extent, origin 0.0 1.5 3.0 4.5
fig, ax = plt.subplots() ticker.LogLocator(base=10, numticks=15) plt.style.use(style)

6754 1234567 contour[f]([X],[Y],Z,…)


103 104 105 106 107 108 109 1010
ax.plot(X, Y, color=’green’) p=Orthographic() API
default grayscale
API classic
from cartopy.crs import Cartographic 1.0 1.0 1.0

X, Y, Z, levels, colors, extent, origin


3
0.5

21
0.5 0.5

fig.savefig(“figure.pdf”)
Tick formatters
0.0 0.0 0.0

API 0.5 0.5 0.5

fig.show()
32 1234567 pcolormesh([X],[Y],Z,…)
1.0 1.0 1.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

10 API ggplot seaborn fast


from matplotlib import ticker

Anatomy of a figure X, Y, Z, vmin, vmax, cmap


1.0 1.0 1.0
ax.[xy]axis.set_[minor|major]_formatter(formatter)

12 Lines 0.5 0.5


API
0.5

0.0 0.0 0.0

7653 3210123 quiver([X],[Y],U,V,…)


ticker.NullFormatter() 0.5 0.5 0.5

Anatomy of a figure
4
linestyle or ls
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6
1.0
0 1 2 3 4 5 6

API ticker.FixedFormatter(['', '0', '1', ...]) bmh Solarize_Light2 seaborn-notebook


432
Title Blue signal
Major tick X, Y, U, V, C, units, angles 0.25 0.50 1 0.75 0.25 2 0.50 0.75 3 0.25 0.50 0.75 4 0.25 0.50 5 1.0 1.0 1.0
Red signal "-" ":" "--" "-." (0,(0.01,2)) 0 0.5 0.5 0.5

1 capstyle or dash_capstyle
Legend ticker.FuncFormatter(lambda x, pos: "[%.2f]" % x) 0.0 0.0 0.0

0.5 0.5 0.5

765 1234567 Z, explode, labels, colors, radius


Minor tick [0.00] [1.00] [2.00] [3.00] [4.00] [5.00] 1.0 1.0 1.0

"butt" "round" "projecting"


0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

pie(X,…) API ticker.FormatStrFormatter('>%d<')

432
3
>0< >1< >2< >3< >4< >5<
Major tick label Grid
Quick reminder
Line
(line plot) 1 Markers API 0
ticker.ScalarFormatter()
1 2 3 4 5

765 1234567T x,text(x,y,text,…) API ticker.StrMethodFormatter('{x}') ax.grid()


Y axis label

2
432 TEX y, text, va, ha, size, weight, transform 0.0 1.0 2.0 3.0 4.0 5.0 ax.patch.set_alpha(0)

1 ax.set_[xy]lim(vmin, vmax)
'.' 'o' 's' 'P' 'X' '*' 'p' 'D' '<' '>' '^' 'v' ticker.PercentFormatter(xmax=5)
Y axis label 0% 20% 40% 60% 80% 100% ax.set_[xy]label(label)
756 1234567 X,fill[_between][x](…)
Markers
(scatter plot)
API '1' '2' '3' '4' '+' 'x' '|' '_' 4 5 6 7 ax.set_[xy]ticks(list)
1 432 Y1, Y2, color, where
Ornaments
ax.set_[xy]ticklabels(list)
1 '$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $''$ $'
markevery
ax.set_[sup]title(title)

1234567
Spines ax.legend(…) API ax.tick_params(width=10, …)
Figure Line 10 [0, -1] (25, 5) [0, 25, -1] handles, labels, loc, title, frameon ax.set_axis_[on|off]()
Axes (line plot)
Advanced plots
0

765
title
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4 fig.tight_layout()
X axis label step(X,Y,[fmt],…) API
Colors Legend
432
Minor tick label API plt.gcf(), plt.gca()
X axis label X, Y, fmt, color, marker, where handletextpad
label
handle
1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 markerfacecolor (mfc) mpl.rc(’axes’, linewidth=1, …)
1
handlelength
’Cn’
Subplots layout
0 Label 1 Label 3 fig.patch.set_alpha(0)
765 1234567 X,boxplot(X,…)
API 0 b 2 g 4 r 6 c 8 m 10 y 12 k 14 w 16 ’x’
API 1 DarkRed Firebrick Crimson IndianRed Salmon labelspacing markeredgecolor (mec) text=r’$\frac{-e^{i\pi}}{2^n}$’
’name’

fig, axs = plt.subplots(3, 3) 3


subplot[s](rows,cols,…) API
42 10 (1,0,0) (1,0,0,0.75) (1,0,0,0.5) (1,0,0,0.25)
notch, sym, bootstrap, widths
0 2
10 #FF0000
4 6 8 10 12 14 16 (R,G,B[,A]) Label 2 Label 4
1
borderpad
0 2 4 #FF0000BB
6 8 #FF000088
10 12 #FF000044
14 16 ’#RRGGBB[AA]’ Keyboard shortcuts
columnspacing numpoints or scatterpoints
10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 API
0 2 4 6 8 10 12 14 16 ’x.y’
756 246 X,errorbar(X,Y,xerr,yerr,…)
borderaxespad
0
API 0 2 4 6 8 10 12 14 16
ctrl + s Save ctrl + w Close plot
G = gridspec(rows,cols,…) API
432 Y, xerr, yerr, fmt ax.colorbar(…) API r Reset view f Fullscreen 0/1
ax = G[0,:]
1 Colormaps API
mappable, ax, cax, orientation
f View forward b View back
71
61
51 1234567 X,hist(X, bins, …)
API plt.get_cmap(name) p Pan view o Zoom to rect
ax.inset_axes(extent) 41
31
21
API bins, range, density, weights
Uniform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x X pan/zoom y Y pan/zoom
111 viridis g Minor grid 0/1 G Major grid 0/1
756 1234567 violinplot(D,…) API magma ax.annotate(…) API
l X axis log/linear L Y axis log/linear

ax = d.new_horizontal(’10%’) 3
d=make_axes_locatable(ax) API
42 D, positions, widths, vert plasma
text, xy, xytext, xycoords, textcoords, arrowprops

1 Sequential
text Ten simple rules
765 1234567 barbs([X],[Y],
Greys READ
API U, V, …) YlOrBr xytext xy
4
textcoords xycoords
X, Y, U, V, C, length, pivot, sizes
231
Wistia 1. Know Your Audience
Getting help Diverging 2. Identify Your Message

6754 1234567 positions, orientation, lineoffsets


Spectral 3. Adapt the Figure
Å matplotlib.org eventplot(positions,…)
API
4. Captions Are Not Optional
coolwarm
Event handling API
H github.com/matplotlib/matplotlib/issues
ď discourse.matplotlib.org
321 Qualitative
RdGy

fig, ax = plt.subplots()
5. Do Not Trust the Defaults
6. Use Color Effectively
W stackoverflow.com/questions/tagged/matplotlib
Ż gitter.im/matplotlib
7
654 1234567 hexbin(X,Y,C,…) API
X, Y, C, gridsize, bins
tab10
tab20
def on_click(event):
print(event)
7. Do Not Mislead the Reader
8. Avoid “Chartjunk”
F twitter.com/matplotlib
a Matplotlib users mailing list
321 Cyclic
twilight
fig.canvas.mpl_connect(
’button_press_event’, on_click)
9. Message Trumps Beauty
10. Get the Right Tool
1234567
Axes adjustments API Uniform colormaps Color names API Legend placement How do I …
plt.subplots_adjust( … ) viridis
plasma
black
k
dimgray
dimgrey
gray
floralwhite
darkgoldenrod
goldenrod
cornsilk
gold
darkturquoise
cadetblue
powderblue
lightblue
deepskyblue
L K J … resize a figure?
→ fig.set_size_inches(w, h)
… save a figure?

A 2 9 1 I
grey lemonchiffon skyblue
inferno
darkgray khaki lightskyblue → fig.savefig(”figure.pdf”)
darkgrey palegoldenrod steelblue
magma silver darkkhaki aliceblue … save a transparent figure?
top lightgray ivory dodgerblue
axes width cividis lightgrey beige lightslategray → fig.savefig(”figure.pdf”, transparent=True)
gainsboro lightyellow lightslategrey
whitesmoke lightgoldenrodyellow slategray … clear a figure/an axes?
w olive slategrey
→ fig.clear() → ax.clear()
Sequential colormaps white
snow
y
yellow
lightsteelblue
cornflowerblue
rosybrown olivedrab royalblue … close all figures?
B 6 10 7 H

figure height
lightcoral yellowgreen ghostwhite
axes height indianred darkolivegreen lavender → plt.close(”all”)
brown greenyellow midnightblue
Greys
firebrick chartreuse navy … remove ticks?
hspace Purples maroon lawngreen darkblue → ax.set_[xy]ticks([])
darkred honeydew mediumblue
Blues r darkseagreen b … remove tick labels ?
red palegreen blue
Greens mistyrose lightgreen slateblue → ax.set_[xy]ticklabels([])
salmon forestgreen darkslateblue
tomato limegreen mediumslateblue … rotate tick labels ?
C 3 8 4 G
left bottom wspace right Oranges
darksalmon darkgreen mediumpurple
Reds coral g rebeccapurple → ax.set_[xy]ticks(rotation=90)
orangered green blueviolet
YlOrBr lightsalmon lime indigo … hide top spine?
sienna seagreen darkorchid
figure width
D E F
YlOrRd seashell mediumseagreen darkviolet → ax.spines[’top’].set_visible(False)
chocolate springgreen mediumorchid
OrRd saddlebrown mintcream thistle … hide legend border?
sandybrown mediumspringgreen plum
peachpuff mediumaquamarine violet → ax.legend(frameon=False)
PuRd peru aquamarine purple
Extent & origin API linen turquoise darkmagenta ax.legend(loc=”string”, bbox_to_anchor=(x,y)) … show error as shaded region?
RdPu bisque lightseagreen m
darkorange mediumturquoise fuchsia
2: upper left 9: upper center 1: upper right → ax.fill_between(X, Y+error, Y‐error)
ax.imshow( extent=…, origin=… ) BuPu burlywood azure magenta
GnBu
antiquewhite
tan
lightcyan
paleturquoise
orchid
mediumvioletred 6: center left 10: center 7: center right … draw a rectangle?
origin="upper" origin="upper" PuBu
navajowhite
blanchedalmond
darkslategray
darkslategrey
deeppink
hotpink 3: lower left 8: lower center 4: lower right → ax.add_patch(plt.Rectangle((0, 0), 1, 1)
5
(0,0) (0,0)
YlGnBu
papayawhip
moccasin
teal
darkcyan
lavenderblush
palevioletred … draw a vertical line?
orange c crimson A: upper right / (-0.1,0.9) B: center right / (-0.1,0.5) → ax.axvline(x=0.5)
PuBuGn wheat aqua pink
oldlace cyan lightpink C: lower right / (-0.1,0.1) D: upper left / (0.1,-0.1) … draw outside frame?
BuGn E: upper center / (0.5,-0.1) F: upper right / (0.9,-0.1)
0
(4,4) (4,4) → ax.plot(…, clip_on=False)
extent=[0,10,0,5] extent=[10,0,0,5] YlGn G: lower left / (1.1,0.1) H: center left / (1.1,0.5)
Image interpolation API … use transparency?
origin="lower" origin="lower" I: upper left / (1.1,0.9) J: lower right / (0.9,1.1) → ax.plot(…, alpha=0.25)
5
(4,4) (4,4)
Diverging colormaps K: lower center / (0.5,1.1) L: lower left / (0.1,1.1) … convert an RGB image into a gray image?
→ gray = 0.2989*R + 0.5870*G + 0.1140*B
PiYG … set figure background color?
0
(0,0) (0,0) Annotation connection styles API → fig.patch.set_facecolor(“grey”)
extent=[0,10,0,5] extent=[10,0,0,5] PRGn
0 10 0 10 BrBG
… get a reversed colormap?
PuOr None none nearest arc3,
rad=0
arc3,
rad=0.3
angle3,
angleA=0, → plt.get_cmap(“viridis_r”)
… get a discrete colormap?
angleB=90

Text alignments
RdGy
API → plt.get_cmap(“viridis”, 10)
RdBu
… show a figure for one second?
ax.text( …, ha=… , va=…, …) RdYlBu
→ fig.show(block=False), time.sleep(1)

Matplotlib
RdYlGn
(1,1)
top
Performance tips
Spectral

bilinear bicubic spline16


angle, angle, arc,
angleA=-90, angleA=-90, angleA=-90,
coolwarm angleB=180, angleB=180, angleB=0,
center bwr
rad=0 rad=25 armA=0,
armB=40,
baseline rad=0 scatter(X, Y) slow
bottom seismic plot(X, Y, marker=”o”, ls=””) fast
(0,0)
left center right
for i in range(n): plot(X[i]) slow
Qualitative colormaps plot(sum([x+[None] for x in X],[])) fast

Text parameters API bar, bar, bar, cla(), imshow(…), canvas.draw() slow
spline36 hanning hamming
fraction=0.3 fraction=-0.3 angle=180,
Pastel1 fraction=-0.2 im.set_data(…), canvas.draw() fast
ax.text(…, family=…, size=…, weight=…) Pastel2
ax.text(…, fontproperties=…) Paired Beyond Matplotlib
The quick brown fox
Accent
xx-large (1.73)
Dark2 Seaborn: Statistical Data Visualization
The quick brown fox x-large (1.44)
Set1 Cartopy: Geospatial Data Processing
The quick brown fox large (1.20) yt: Volumetric data Visualization
The quick brown fox
The quick brown fox
medium
small
(1.00)
(0.83)
Set2
Set3
hermite kaiser quadric Annotation arrow styles API mpld3: Bringing Matplotlib to the browser
The quick brown fox
The quick brown fox
x-small (0.69)
tab10 Datashader: Large data processing pipeline
xx-small (0.58)
plotnine: A Grammar of Graphics for Python
The quick brown fox jumps over the lazy dog black (900)
tab20
- <- ->
The quick brown fox jumps over the lazy dog bold (700) tab20b
The quick brown fox jumps over the lazy dog semibold (600) tab20c
The quick brown fox jumps over the lazy dog normal (400) <-> <|- -|> Matplotlib Cheatsheets
The quick brown fox jumps over the lazy dog Copyright (c) 2021 Matplotlib Development Team
ultralight (100)

Miscellaneous colormaps
catrom gaussian bessel Released under a CC‐BY 4.0 International License
The quick brown fox jumps over the lazy dog monospace
The quick brown fox jumps over the lazy dog serif
<|-|> ]- -[
The quick brown fox jumps over the lazy dog sans
The quick brown fox jumps over the lazy dog cursive
terrain
ocean ]-[ |-| ]->
The quick brown fox jumps over the lazy dog italic
The quick brown fox jumps over the lazy dog normal cubehelix
rainbow <-[ simple fancy
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
small-caps
normal twilight mitchell sinc lanczos
Matplotlib for beginners
Matplotlib is a library for making 2D plots in Python. It is
Z = np.random.uniform(0, 1, (8,8)) 765 Organize
designed with the philosophy that you should be able to
create simple plots with just a few commands: 432 You can plot several data on the the same figure, but you
ax.contourf(Z)
1 can also split a figure in several subplots (named Axes):
1 Initialize
Z = np.random.uniform(0, 1, 4) 765 1234567 765
432
X = np.linspace(0, 10, 100)
import numpy as np Y1, Y2 = np.sin(X), np.cos(X) 432
import matplotlib.pyplot as plt ax.pie(Z)
1 ax.plot(X, Y1, X, Y2)
1
Z = np.random.normal(0, 1, 100) 71
61
51 1234567 1234567
2 Prepare
41
31
21
fig, (ax1, ax2) = plt.subplots((2,1))
ax1.plot(X, Y1, color=”C1”)
X = np.linspace(0, 4*np.pi, 1000) ax.hist(Z)
111 ax2.plot(X, Y2, color=”C0”)
Y = np.sin(X)
X = np.arange(5) 765 1234567
432
fig, (ax1, ax2) = plt.subplots((1,2))
3 Render Y = np.random.uniform(0, 1, 5) ax1.plot(Y1, X, color=”C1”)

fig, ax = plt.subplots()
ax.errorbar(X, Y, Y∕4)
1 ax2.plot(Y2, X, color=”C0”)

ax.plot(X, Y) Z = np.random.normal(0, 1, (100,3)) 765 1234567


fig.show()
432 Label (everything)
ax.boxplot(Z)
1
4 Observe
246 ax.plot(X, Y)
543
A Sine wave

21
1.0 fig.suptitle(None)
0.5 Tweak ax.set_title(”A Sine wave”)

1234567
0.0
0.5 You can modify pretty much anything in a plot, including lim-
ax.plot(X, Y)
1.0 its, colors, markers, line width and styles, ticks and ticks la-
0 5 10 15 20 25 30 ax.set_ylabel(None)
bels, titles, etc.
ax.set_xlabel(”Time”)

765
Time
Choose X = np.linspace(0, 10, 100)
Y = np.sin(X) 432 Explore
Matplotlib offers several kind of plots (see Gallery): ax.plot(X, Y, color=”black”)
1 Figures are shown with a graphical user interface that al-
X = np.random.uniform(0, 1, 100) 765 X = np.linspace(0, 10, 100) 765 1234567 lows to zoom and pan the figure, to navigate between the
Y = np.random.uniform(0, 1, 100) 432 Y = np.sin(X) 432 different views and to show the value under the mouse.
ax.scatter(X, Y)
1 ax.plot(X, Y, linestyle=”--”)
1
X = np.arange(10) 765 1234567 X = np.linspace(0, 10, 100) 765 1234567 Save (bitmap or vector format)
Y = np.random.uniform(1, 10, 10) 432 Y = np.sin(X) 432
ax.bar(X, Y)
1 ax.plot(X, Y, linewidth=5)
1 fig.savefig(”my-first-figure.png”, dpi=300)
fig.savefig(”my-first-figure.pdf”)
Z = np.random.uniform(0, 1, (8,8)) 765 1234567 X = np.linspace(0, 10, 100) 765 1234567
432 Y = np.sin(X) 432
1 1
Matplotlib 3.5.0 handout for beginners. Copyright (c) 2021 Matplotlib Development
ax.imshow(Z) ax.plot(X, Y, marker=”o”) Team. Released under a CC-BY 4.0 International License. Supported by NumFOCUS.

1234567 1234567
Matplotlib for intermediate users
A matplotlib figure is composed of a hierarchy of elements Ticks & labels Legend
that forms the actual figure. Each element can be modified.
from mpl.ticker import MultipleLocator as ML ax.plot(X, np.sin(X), ”C0”, label=”Sine”)
from mpl.ticker import ScalarFormatter as SF ax.plot(X, np.cos(X), ”C1”, label=”Cosine”)
4
Anatomy of a figure ax.xaxis.set_minor_locator(ML(0.2)) ax.legend(bbox_to_anchor=(0,1,1,.1),ncol=2,
Title Blue signal ax.xaxis.set_minor_formatter(SF()) mode=”expand”, loc=”lower left”)
Major tick Red signal ax.tick_params(axis=’x’,which=’minor’,rotation=90)
Legend Sine Sine and Cosine Cosine
Minor tick 0 1 2 3 4 5

0.2
0.4
0.6
0.8

1.2
1.4
1.6
1.8

2.2
2.4
2.6
2.8

3.2
3.4
3.6
3.8

4.2
4.4
4.6
4.8
3
Major tick label Grid Lines & markers
Line
(line plot)
X = np.linspace(0.1, 10*np.pi, 1000) Annotation
Y axis label

2 Y = np.sin(X)
ax.plot(X, Y, ”C1o:”, markevery=25, mec=”1.0”) ax.annotate(”A”, (X[250],Y[250]),(X[250],-1),
Y axis label Markers ha=”center”, va=”center”,arrowprops =
(scatter plot) 1 {”arrowstyle” : ”->”, ”color”: ”C1”})
0
1 1 1
0 5 10 15 20 25 30 0
Spines 1 A
Figure Line 0 5 10 15 20 25 30
Axes (line plot) Scales & projections
0
0 0.25 0.50 0.75 1 1.25 1.50 1.75 2 2.25 2.50 2.75 3 3.25 3.50 3.75 4 Colors
X axis label fig, ax = plt.subplots()
Minor tick label
X axis label ax.set_xscale(”log”)
ax.plot(X, Y, ”C1o-”, markevery=25, mec=”1.0”) 1 AnyC0
color can be used, but Matplotlib offers sets of colors:
C1 C2 C3 C4 C5 C6 C7 C8 C9
10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure, axes & spines
1 0 2 4 6 8 10 12 14 16
0 0
1 0 2 4 6 8 10 12 14 16
fig, axs = plt.subplots((3,3)) 10 1 100 101
Size & DPI
axs[0,0].set_facecolor(”#ddddff”)
axs[2,2].set_facecolor(”#ffffdd”) Text & ornaments Consider a square figure to be included in a two-columns A4
paper with 2cm margins on each side and a column separa-
gs = fig.add_gridspec(3, 3) tion of 1cm. The width of a figure is (21 - 2*2 - 1)/2 = 8cm.
ax.fill_betweenx([-1,1],[0],[2*np.pi])
ax = fig.add_subplot(gs[0, :]) One inch being 2.54cm, figure size should be 3.15×3.15 in.
ax.text(0, -1, r” Period $\Phi$”)
ax.set_facecolor(”#ddddff”)
fig = plt.figure(figsize=(3.15,3.15), dpi=50)
1 plt.savefig(”figure.pdf”, dpi=600)
fig, ax = plt.subplots() 0
ax.spines[”top”].set_color(”None”) 1 Period Matplotlib 3.5.0 handout for intermediate users. Copyright (c) 2021 Matplotlib De-
velopment Team. Released under a CC-BY 4.0 International License. Supported by
ax.spines[”right”].set_color(”None”) 0 5 10 15 20 25 30 NumFOCUS.
Matplotlib tips & tricks
Transparency Text outline Colorbar adjustment
Scatter plots can be enhanced by using transparency (al- Use text outline to make text more visible. You can adjust a colorbar’s size when adding it.
pha) in order to show area with higher density. Multiple scat-
import matplotlib.patheffects as fx im = ax.imshow(Z)
ter plots can be used to delineate a frontier. text = ax.text(0.5, 0.1, ”Label”)
text.set_path_effects([ cb = plt.colorbar(im,
X = np.random.normal(-1, 1, 500) fx.Stroke(linewidth=3, foreground=’1.0’), fraction=0.046, pad=0.04)
Y = np.random.normal(-1, 1, 500) fx.Normal()]) cb.set_ticks([])
ax.scatter(X, Y, 50, ”0.0”, lw=2) # optional
ax.scatter(X, Y, 50, ”1.0”, lw=0) # optional
ax.scatter(X, Y, 40, ”C1”, lw=0, alpha=0.1)

Multiline plot Taking advantage of typography


You can use a condensed font such as Roboto Condensed
Rasterization You can plot several lines at once using None as separator.
to save space on tick labels.
X,Y = [], []
If your figure has many graphical elements, such as a huge for x in np.linspace(0, 10*np.pi, 100):
for tick in ax.get_xticklabels(which=’both’):
tick.set_fontname(”Roboto Condensed”)
scatter, you can rasterize them to save memory and keep X.extend([x, x, None]), Y.extend([0, sin(x), None])
other elements in vector format. ax.plot(X, Y, ”black”)
                   
     
X = np.random.normal(-1, 1, 10_000)
Y = np.random.normal(-1, 1, 10_000)
ax.scatter(X, Y, rasterized=True) Getting rid of margins
fig.savefig(”rasterized-figure.pdf”, dpi=600)
Once your figure is finished, you can call tight_layout()
to remove white margins. If there are remaining margins,
Dotted lines you can use the pdfcrop utility (comes with TeX live).
Offline rendering
To have rounded dotted lines, use a custom linestyle and
Use the Agg backend to render a figure directly in an array. modify dash_capstyle. Hatching
from matplotlib.backends.backend_agg import FigureCanvas ax.plot([0,1], [0,0], ”C1”, You can achieve a nice visual effect with thick hatch pat-
canvas = FigureCanvas(Figure())) linestyle = (0, (0.01, 1)), dash_capstyle=”round”) terns.
... # draw som stuff ax.plot([0,1], [1,1], ”C1”, 59%
53%
canvas.draw() linestyle = (0, (0.01, 2)), dash_capstyle=”round”) cmap = plt.get_cmap(”Oranges”)
38%
Z = np.array(canvas.renderer.buffer_rgba()) plt.rcParams[’hatch.color’] = cmap(0.2) 27%
plt.rcParams[’hatch.linewidth’] = 8
ax.bar(X, Y, color=cmap(0.6), hatch=”∕” )
2018 2019

Range of continuous colors


Combining axes Read the documentation
You can use colormap to pick from a range of continuous
colors. You can use overlaid axes with different projections. Matplotlib comes with an extensive documentation explain-
ing the details of each command and is generally accom-
X = np.random.randn(1000, 4) ax1 = fig.add_axes([0,0,1,1],
cmap = plt.get_cmap(”Oranges”) label=”cartesian”)
panied by examples. Together with the huge online gallery,
colors = cmap([0.2, 0.4, 0.6, 0.8]) ax2 = fig.add_axes([0,0,1,1], this documentation is a gold-mine.
label=”polar”,
Matplotlib 3.5.0 handout for tips & tricks. Copyright (c) 2021 Matplotlib Development
ax.hist(X, 2, histtype=’bar’, color=colors) projection=”polar”)
Team. Released under a CC-BY 4.0 International License. Supported by NumFOCUS.
6.2. R language
Base R Vectors Programming
Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector

Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }

Accessing the help files seq(2, 3, by=0.5) 2.0 2.5 3.0


A complex Example Example
sequence
?mean for (i in 1:4){ while (i < 5){
Get help of a particular function. rep(1:2, times=3) 1 2 1 2 1 2 Repeat a vector
j <- i + 10 print(i)
help.search(‘weighted mean’)
print(j) i <- i + 1
Search the help files for a word or phrase. rep(1:2, each=3) 1 1 1 2 2 2 Repeat elements
of a vector
help(package = ‘dplyr’) } }
Find help for a package. Vector Functions
More about an object If Statements Functions
sort(x) rev(x)
Return x sorted. Return x reversed. if (condition){ function_name <- function(var){
str(iris)
table(x) unique(x) Do something
Get a summary of an object’s structure. Do something
See counts of values. See unique values. } else {
class(iris) Do something different return(new_variable)
Find the class an object belongs to. } }
Selecting Vector Elements
Example
Using Packages Example
By Position if (i > 3){ square <- function(x){
install.packages(‘dplyr’) x[4] The fourth element. print(‘Yes’)
squared <- x*x
Download and install a package from CRAN. } else {
print(‘No’) return(squared)
library(dplyr) x[-4] All but the fourth.
} }
Load the package into the session, making all
its functions available to use. x[2:4] Elements two to four.
Reading and Writing Data Also see the readr package.
dplyr::select All elements except
x[-(2:4)] Input Ouput Description
Use a particular function from a package. two to four.

Elements one and Read and write a delimited text


data(iris) df <- read.table(‘file.txt’) write.table(df, ‘file.txt’)
x[c(1, 5)] file.
five.
Load a built-in dataset into the environment.
By Value Read and write a comma
Working Directory x[x == 10]
Elements which df <- read.csv(‘file.csv’) write.csv(df, ‘file.csv’) separated value file. This is a
special case of read.table/
are equal to 10.
write.table.
getwd()
All elements less
Find the current working directory (where x[x < 0]
than zero. Read and write an R data file, a
inputs are found and outputs are sent). load(‘file.RData’) save(df, file = ’file.Rdata’)
file type special for R.
x[x %in% Elements in the set
setwd(‘C://file/path’) c(1, 2, 5)] 1, 2, 5.
Change the current working directory.
Named Vectors Greater than
a == b Are equal a > b Greater than a >= b is.na(a) Is missing
or equal to
Conditions
Use projects in RStudio to set the working Element with Less than or
x[‘apple’] a != b Not equal a < b Less than a <= b is.null(a) Is null
directory to the folder you are working in. name ‘apple’. equal to

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com Learn more at web page or vignette • package version • Updated: 3/15
Types Matrices Strings Also see the stringr package.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)

w
ww Transpose
grep(pattern, x) Find regular expression matches in x.

ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).

w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.

Integers or floating point


w
ww Matrix Multiplication toupper(x) Convert to uppercase.

ww
as.numeric 1, 0, 1
numbers.

w m[2, 3] - Select an element


solve(m, n)
Find x in: m * x = n
tolower(x) Convert to lowercase.
as.character '1', '0', '1'
Character strings. Generally

w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.

as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is a collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor by ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(y ~ x, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr package. Data Frames Linear model. Perform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(y ~ x, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Perform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poisson rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.

rm(x) Remove x from the


environment. df[2, ]
ncol(df)
Number of
Plotting Also see the ggplot2 package.

columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate package.

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Read functions Parsing data types
Data
Tidy Import
Data
with readr, tibble, and tidyr
Read tabular data to tibbles readr functions guess the types of each column
and convert types when appropriate (but will
with tidyr Cheat Sheet These functions share the common arguments: NOT convert strings to factors automatically).
Cheat Sheet read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"),
A message shows the type of each column in
quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
the result.
min(1000, n_max), progress = interactive())
## Parsed with column specification:
A B C read_csv() ## cols(
a,b,c age is an
R’s tidyverse is built around tidy data stored in 1 2 3 Reads comma delimited files. ## age = col_integer(),
tibbles, an enhanced version of a data frame. 1,2,3 4 5 NA read_csv("file.csv") ## sex = col_character(), integer
4,5,NA ## earn = col_double()
The front side of this sheet shows how ## ) sex is a
earn is a double (numeric) character
to read text files into R with readr. A B C read_csv2()
a;b;c 1. Use problems() to diagnose problems
The reverse side shows how to create
1 2 3 Reads Semi-colon delimited files.
1;2;3 x <- read_csv("file.csv"); problems(x)
tibbles with tibble and to layout tidy
4 5 NA read_csv2("file2.csv")
4;5;NA
data with tidyr. 2. Use a col_ function to guide parsing
A B C read_delim(delim, quote = "\"", escape_backslash = FALSE, • col_guess() - the default
Other types of data a|b|c escape_double = TRUE) Reads files with any delimiter.
1 2 3 • col_character()
Try one of the following packages to import 1|2|3 4 5 NA read_delim("file.txt", delim = "|") • col_double()
other types of files 4|5|NA • col_euro_double()
• haven - SPSS, Stata, and SAS files • col_datetime(format = "") Also
• readxl - excel files (.xls and .xlsx) A B C read_fwf(col_positions)
abc col_date(format = "") and col_time(format = "")
• DBI - databases
1 2 3 Reads fixed width files.
123 • col_factor(levels, ordered = FALSE)
• jsonlite - json
4 5 NA read_fwf("file.fwf", col_positions = c(1, 3, 5))
4 5 NA • col_integer()
• xml2 - XML read_tsv() • col_logical()
• httr - Web APIs Reads tab delimited files. Also read_table(). • col_number()
• rvest - HTML (Web Scraping) read_tsv("file.tsv") • col_numeric()
• col_skip()
x <- read_csv("file.csv", col_types = cols(
Write functions Useful arguments
A = col_double(),
a,b,c B = col_logical(),
Save x, an R object, to path, a file path, with: Example file 1 2 3 Skip lines
C = col_factor()
1,2,3 write_csv (path = "file.csv", read_csv("file.csv",
write_csv(x, path, na = "NA", append = FALSE, 4 5 NA ))
4,5,NA x = read_csv("a,b,c\n1,2,3\n4,5,NA")) skip = 1)
col_names = !append)
3. Else, read in as character vectors then parse
Tibble/df to comma delimited file. with a parse_ function.
A B C No header A B C
Read in a subset
write_delim(x, path, delim = " ", na = "NA", 1 2 3 1 2 3 • parse_guess(x, na = c("", "NA"), locale =
append = FALSE, col_names = !append) read_csv("file.csv", read_csv("file.csv",
4 5 NA default_locale())
col_names = FALSE) n_max = 1)
Tibble/df to file with any delimiter. • parse_character(x, na = c("", "NA"), locale =
A B C
write_excel_csv(x, path, na = "NA", append = x y z default_locale())
Provide header 1 2 3
FALSE, col_names = !append) A B C Missing Values • parse_datetime(x, format = "", na = c("", "NA"),
read_csv("file.csv", NA NA NA
locale = default_locale()) Also parse_date()
Tibble/df to a CSV for excel 1 2 3
col_names = c("x", "y", "z")) read_csv("file.csv",
and parse_time()
write_file(x, path, append = FALSE) 4 5 NA na = c("4", "5", "."))
• parse_double(x, na = c("", "NA"), locale =
String to file. default_locale())
Read non-tabular data
write_lines(x, path, na = "NA", append = • parse_factor(x, levels, ordered = FALSE, na =
FALSE) read_file(file, locale = default_locale())
read_lines_raw(file, skip = 0, n_max = -1L, c("", "NA"), locale = default_locale())
String vector to file, one element per line. Read a file into a single string. progress = interactive()) • parse_integer(x, na = c("", "NA"), locale =
write_rds(x, path, compress = c("none", "gz", read_file_raw(file) Read each line into a raw vector. default_locale())
"bz2", "xz"), ...) Read a file into a raw vector. • parse_logical(x, na = c("", "NA"), locale =
read_log(file, col_names = FALSE, col_types =
Object to RDS file. read_lines(file, skip = 0, n_max = -1L, locale = NULL, skip = 0, n_max = -1, progress = default_locale())
write_tsv(x, path, na = "NA", append = FALSE, default_locale(), na = character(), progress = interactive()) • parse_number(x, na = c("", "NA"), locale =
col_names = !append) interactive()) Apache style log files. default_locale())
Tibble/df to tab delimited files. Read each line into its own string. x$A <- parse_number(x$A)

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Tibbles - an enhanced data frame Tidy Data with tidyr
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new S3 class for
A table is tidy if: Tidy data: Split and Combine Cells
storing tabular data, the tibble. Tibbles inherit the A * B -> C
data frame class, but improve two behaviors: A B C A B C A B C A * B Use these functions to split or combine cells into
C
individual, isolated values.
• Display - When you print a tibble, R provides a
concise view of the data that fits on one screen. & separate(data, col, into, sep = "[^[:alnum:]]+",
• Subsetting - [ always returns a new tibble, remove = TRUE, convert = FALSE,
Each variable is in Each observation, or Makes variables easy Preserves cases during
[[ and $ always return a vector. extra = "warn", fill = "warn", ...)
its own column case, is in its own row to access as vectors vectorized operations
• No partial matching - You must use full Separate each cell in a column to make several
column names when subsetting Reshape Data - change the layout of values in a table columns.
table3
Use gather() and spread() to reorganize the values of a table into a new layout. Each uses the idea of a
# A tibble: 234 × 6
manufacturer model displ
<chr> <chr> <dbl> country year rate country year cases pop
1
2
audi
audi
a4
a4
1.8
1.8
key column: value column pair. A 1999 0.7K/19M A 1999 0.7K 19M
3 audi a4 2.0

gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = FALSE,
4 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
5 audi a4 2.8
6 audi a4 2.8 B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
convert = FALSE, factor_key = FALSE) drop = TRUE, sep = NULL) B 2000 80K/174M B 2000 80K 174

w
w
8 audi a4 quattro 1.8
9 audi a4 quattro 1.8 C 1999 212K/1T C 1999 212K 1T
10 audi a4 quattro 2.0
# ... with 224 more rows, and 3
# more variables: year <int>,
Gather moves column names into a key Spread moves the unique values of a key column C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr> column, gathering the column values into a into the column names, spreading the values of a
tibble display single value column. value column across the new columns that result. separate_rows(table3, rate,
156 1999 6 auto(l4) table4a table2
into = c("cases", "pop"))
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4)
country 1999 2000 country year cases country year type count country year cases pop
160 1999
161 1999
4 manual(m5)
4 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M separate_rows(data, ..., sep = "[^[:alnum:].]+",
162 2008 4 manual(m5) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
163 2008
164 2008
4 manual(m5)
4 auto(l4) C 212K 213K C 1999 212K A 2000 cases 2K B 1999 37K 172M convert = FALSE)
165 2008 4 auto(l4)
166 1999 4 auto(l4) A 2000 2K A 2000 pop 20M B 2000 80K 174M
A large table [ reached
-- omitted
getOption("max.print")
68 rows ] B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make several
to display data frame display C 2000 213K B 1999 pop 172M C 2000 213K 1T rows. Also separate_rows_().
key value B 2000 cases 80K
• Control the default appearance with options: table3
B 2000 pop 174M
country year rate country year rate
options(tibble.print_max = n, C 1999 cases 212K
A 1999 0.7K/19M A 1999 0.7K
tibble.print_min = m, tibble.width = Inf) C 1999 pop 1T
A 2000 2K/20M A 1999 19M
C 2000 cases 213K
B 1999 37K/172M A 2000 2K
• View entire data set with View(x, title) or C 2000 pop 1T
B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
glimpse(x, width = NULL, …) C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
(required for some older packages) B 2000 174M
Handle Missing Values C 1999 212K
Construct a tibble in two ways C 1999 1T
drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
tibble(…) replace = list(), ...) C 2000 1T
Drop rows containing Fill in NA’s in … columns with most
Construct by columns. Both make
NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
tibble(x = 1:3, this tibble x x x
y = c("a", "b", "c")) x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
A 1 A 1 A 1 A 1 A 1 A 1 unite(data, col, ..., sep = "_", remove = TRUE)
tribble(…) A tibble: 3 × 2 B NA D 3 B NA B 1 B NA B 2
Construct by rows. x y C
D
NA
3
C
D
NA
3
C
D
1
3
C
D
NA
3
C
D
2
3 Collapse cells across several columns to
tribble( <int> <dbl> E NA E NA E 3 E NA E 2 make a single column.
1 1 a
~x, ~y, 2 2 b table5
1, "a", 3 3 c drop_na(x, x2) fill(x, x2) replace_na(x,list(x2 = 2), x2)
country century year country year
2, "b", Afghan 19 99 Afghan 1999
3, "c") Expand Tables - quickly create tables with combinations of values Afghan 20 0 Afghan 2000
as_tibble(x, …) Convert data frame to tibble. Brazil 19 99 Brazil 1999

complete(data, ..., fill = list()) expand(data, ...) Brazil 20 0 Brazil 2000


enframe(x, name = "name", value = "value") China 19 99 China 1999
Converts named vector to a tibble with a Adds to the data missing combinations of the Create new tibble with all possible combinations China 20 0 China 2000

names column and a values column. values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb)
is_tibble(x) Test whether x is a tibble. col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Data Wrangling Tidy Data - A foundation for wrangling in R
with dplyr and tidyr F MA F MA
Tidy data complements R’s vectorized M * A F

Cheat Sheet In a tidy


data set: & operations. R will automatically preserve
observations as you manipulate variables.
Each variable is saved Each observation is No other format works as intuitively with R. M * A
in its own column saved in its own row

Syntax - Helpful conventions for wrangling Reshaping Data - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than w
ww
w w
w
ww
w
w
Combine vectors into data frame
(optimized).
data frames. R displays only the data that fits onscreen: ww
1005
A 1005
A
1013
A dplyr::arrange(mtcars, mpg)
1013
A
1010
A 1010
A
tidyr::spread(pollution, size, amount) Order rows by values of a column
1010
A
Source: local data frame [150 x 5]

Sepal.Length Sepal.Width Petal.Length Gather columns into rows. 1010


tidyr::gather(cases, "year", "n", 2:4)
A Spread rows into columns.
(low to high).
dplyr::arrange(mtcars, desc(mpg))
1 5.1 3.5 1.4
2 4.9 3.0 1.4 Order rows by values of a column
3 4.7 3.2 1.3
(high to low).
4
5
4.6
5.0
3.1
3.6
1.5
1.4
w
110w
110p w
110
1007 w
p
110
1007 w
110w
110p
1007 w
110w
110p
1007 dplyr::rename(tb, y = year)
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
45 45
45
10091009
45
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several. 45 45
45
1009 1009
45
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
Rename the columns of a data
frame.

dplyr::glimpse(iris) Subset Observations (Rows) Subset Variables (Columns)


Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V). w
110w
110w
110ww wwww
110
110 w
110p
1007p
1007w
110
dplyr::filter(iris, Sepal.Length > 7) 1009
45
1009
45
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Extract rows that meet logical criteria. Select columns by name or helper function.
dplyr::distinct(iris)
Helper functions for select - ?select
Remove duplicate rows. select(iris, contains("."))
dplyr::sample_frac(iris, 0.5, replace = TRUE) Select columns whose name contains a character string.
Randomly select fraction of rows. select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
dplyr::sample_n(iris, 10, replace = TRUE) select(iris, everything())
dplyr::%>% Randomly select n rows. Select every column.
Passes object on left hand side as first argument (or . dplyr::slice(iris, 10:15) select(iris, matches(".t."))
Select columns whose name matches a regular expression.
argument) of function on righthand side. Select rows by position.
select(iris, num_range("x", 1:5))
dplyr::top_n(storms, 2, date) Select columns named x1, x2, x3, x4, x5.
x %>% f(y) is the same as f(x, y)
Select and order top n entries (by group if grouped data). select(iris, one_of(c("Species", "Genus")))
y %>% f(x, ., z) is the same as f(x, y, z )
Select columns whose names are in a group of names.
Logic in R - ?Comparison, ?base::Logic select(iris, starts_with("Sepal"))
"Piping" with %>% makes code more readable, e.g. < Less than != Not equal to Select columns whose name starts with a character string.
> Greater than %in% Group membership select(iris, Sepal.Length:Petal.Width)
iris %>%
group_by(Species) %>% == Equal to is.na Is NA Select all columns between Sepal.Length and Petal.Width (inclusive).
summarise(avg = mean(Sepal.Width)) %>% <= Less than or equal to !is.na Is not NA select(iris, -Species)
arrange(avg) >= Greater than or equal to &,|,!,xor,any,all Boolean operators Select all columns except Species.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
Summarise Data Make New Variables Combine Data Sets
a b
x1 x2 x1 x3

+ =
A 1 A T
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::left_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank)) B 2 F
Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2
dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns.
variable (with or without weights).
D T NA

x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function x1
A
x2
1
x3
T
dplyr::full_join(a, b, by = "x1")
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B
C
2
3
F
NA
Join data. Retain all values, all rows.
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T

dplyr::first min Filtering Joins


dplyr::lead dplyr::cumall
First value of a vector. Minimum value in a vector. x1 x2 dplyr::semi_join(a, b, by = "x1")
Copy with values shifted by 1. Cumulative all A 1
dplyr::last max B 2 All rows in a that have a match in b.
dplyr::lag dplyr::cumany
Last value of a vector. Maximum value in a vector. dplyr::anti_join(a, b, by = "x1")
Copy with values lagged by 1. Cumulative any x1 x2

dplyr::nth mean
C 3
dplyr::dense_rank dplyr::cummean All rows in a that do not have a match in b.
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector.
+ =
A 1 B 2
dplyr::n_distinct var Ranks. Ties get min rank. Cumulative sum B 2 C 3

# of distinct values in Variance of a vector. dplyr::percent_rank cummax C 3 D 4


Set Operations
a vector. sd Ranks rescaled to [0, 1]. Cumulative max
IQR Standard deviation of a dplyr::row_number cummin x1
B
x2
2 dplyr::intersect(y, z)
IQR of a vector. vector. Ranks. Ties got to first value. Cumulative min C 3
Rows that appear in both y and z.
dplyr::ntile cumprod x1 x2

Group Data Bin vector into n buckets. Cumulative prod


A
B
1
2
dplyr::union(y, z)
C 3 Rows that appear in either or both y and z.
dplyr::group_by(iris, Species) dplyr::between pmax D 4

Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdiff(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…)
x1
A
x2
1

Compute separate summary row for each group. Compute new variables by group.
B 2 dplyr::bind_rows(y, z)
C 3
B
C
2
3
Append z to y as new rows.
D 4
ir ir dplyr::bind_cols(y, z)
C x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
C 3 D 4 Caution: matches rows by position.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations
> DT[.N-1]
data.table “Take DT, subset rows using i, then calculate j grouped by by” > DT[,.N]
Return the penultimate row of the DT
Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is after := V1 V2 V1 combinations of groups specified in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base R’s data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is after :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the first and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4

Creating A data.table
3: 15.18 F -1.8893 3
1: A -0.478 22
4: 1619.71 D -0.3571 4 2: B -0.478 26

> set.seed(45L) Create a data.table > DT[,V1:=NULL] Remove V1 3: C -0.478 30


> DT[,c("V1","V2"):=NULL] Remove columns V1 and V2 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> DT <- data.table(V1=c(1L,2L), and call it DT .SDcols=paste0("V",3:4)] V2
V2=LETTERS[1:3], > Cols.chosen=c("A","B")
V3=round(rnorm(4),4), > DT[,Cols.Chosen:=NULL] Delete the column with column name
V4=1:12) Cols.chosen
> DT[,(Cols.Chosen):=NULL] Delete the columns specified in the Chaining
variable Cols.chosen
Subsetting Rows Using i > DT <- DT[,.(V4.Sum=sum(V4)),
by=V1]
Calculate sum of V4, grouped by V1

> DT[3:5,] Select 3rd to 5th row Indexing And Keys V1 V4.Sum
> DT[3:5] Select 3rd to 5th row 1: 1 36
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly 2: 2 42
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2 > DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
Manipulating on Columns in j by=V1][V4.Sum>40] (chaining)
1: 1 A -0.2392 1
2: 2 A -1.6148 4 V1 V4.Sum
3: 1 A 1.0498 7 1: 2 42
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
[1] “A” “B” “C” “A” “B” “C” ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C by=V1][order(-V1)] ordered on V1
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="first"] Return first row of all rows that match value A in key V1 V4.Sum
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 1: 2 42
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 2: 1 36
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1
V1 V4.Sum
> DT[c("A","C"),
sum(V4),
Return sum of column V4 for rows of V2 that have value A,
and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the first key (V1) and the c("V2.rating","V3.DC"))
value C for the second key (V2)
setnames()
V1 V2 V3 V4
sign V4.Sum
1: 0 36 1: 2 C 0.3262 6
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the first key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable you’re grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the specified vector (invisible)
2: 2 A 0.3262 10
by=V1] after subsetting on the first 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively
Data Transformation with data.table : : CHEAT SHEET
Basics Manipulate columns with j Group according to by
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native a a a dt[, j, by = .(a)] – group rows by
EXTRACT
data frame objects into data.tables with new and enhanced values in specified columns.
functionality. The basics of working with data.tables are: dt[, c(2)] – extract columns by number. Prefix
column numbers with “-” to drop. dt[, j, keyby = .(a)] – group and
dt[i, j, by] simultaneously sort rows by values
in specified columns.
Take data.table dt, b c b c dt[, .(b, c)] – extract columns by name.
subset rows using i COMMON GROUPED OPERATIONS
and manipulate columns with j,
grouped according to by. dt[, .(c = sum(b)), by = a] – summarize rows within groups.

data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.

Create a data.table Summary functions like mean(), median(), min(),


max(), etc. can be used to summarize rows. dt[, .SD[.N], by = a] – extract last row of groups.

data.table(a = c(1, 2), b = c("a", "b")) – create a data.table from


scratch. Analogous to data.frame(). COMPUTE COLUMNS*
Chaining
setDT(df)* or as.data.table(df) – convert a data frame or a list to c dt[, c := 1 + 2] – compute a column based on
a data.table.
3 an expression. dt[…][…] – perform a sequence of data.table operations by
3
chaining multiple “[]”.

a a c dt[a == 1, c := 1 + 2] – compute a column


Subset rows using i 2
1
2
1
NA
3
based on an expression but only for a subset
of rows. Functions for data.tables
dt[1:2, ] – subset rows based on row numbers.
c d dt[, `:=`(c = 1 , d = 2)] – compute multiple
1 2 columns based on separate expressions. REORDER
1 2
a b a b setorder(dt, a, -b) – reorder a data.table
1 2 1 2 according to specified columns. Prefix column
a a dt[a > 5, ] – subset rows based on values in DELETE COLUMN 2 2 1 1 names with “-” for descending order.
1 1 2 2
2 6 one or more columns.
6 c dt[, c := NULL] – delete a column.
5

* SET FUNCTIONS AND :=


LOGICAL OPERATORS TO USE IN i CONVERT COLUMN TYPE data.table’s functions prefixed with “set” and the operator “:=”
work without “<-” to alter data without making copies in
< <= is.na() %in% | %like% b b dt[, b := as.integer(b)] – convert the type of a memory. E.g., the more efficient “setDT(df)” is analogous to
> >= !is.na() ! & %between% 1.5 1 column using as.integer(), as.numeric(), “df <- as.data.table(df)”.
2.6 2 as.character(), as.Date(), etc..

CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows
1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A z 2 4 B 1 2 3 4
B x 1 3
value.var = c("a", "b")) ROW IDS
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z id y a b melt(dt, 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with within groups, duplicate a column with rows
+ = id.vars = c("id"),
A 1 2 3 4 A 1 1 3
2 a 5 2 c 5 1 c 5 2 equal and unequal values. B 1 2 3 4 B 1 1 3 leading by specified amount.
3 b 6 1 a 8 NA a 8 1 A 2 2 4 measure.vars = patterns("^a", "^b"),
B 2 2 4 variable.name = "y",
value.name = c("a", "b"))
ROLLING JOIN read & write files
a id date b id date a id date b
Reshape a data.table from wide to long format.
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 dt A data.table. IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1 id.vars ID columns with IDs for multiple entries.
3 A 01-01-2014 measure.vars Columns containing values to fill into cells (often in fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010
pattern form).
2 B 01-01-2012
variable.name, Names of new columns for variables and values fread("file.csv", select = c("a", "b")) – read specified columns from a
value.name derived from old headers. flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most
recent preceding match with the left data.table according to date
columns. “roll = -Inf” reverses direction. EXPORT
fwrite(dt, "file.csv") – write data to a flat file from R.

CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
Machine Learning Modelling in R : : CHEAT SHEET
Supervised & Unsupervised Learning Meta-Algorithm, Time Series & Model Validation

Standard Modelling Workflow Time Series View

CC BY SA Arnaud Amsellem • thertrader@gmail.com • www.thertrader.com • Updated: 2018-03


Setup Training & Testing Refining Performance

createDummyFeatures(obj=,target=,method=,cols=) setHyperPars(learner=,...) makeParamSet(make<type>Param())


• makeNumericParam(id=,lower=,upper=,trafo=)
target cols • makeIntegerParam(id=,lower=,upper=,trafo=)
• makeIntegerVectorParam(id=,len=,lower=,upper=,
normalizeFeatures(obj=,target=,method=,cols=, makeLearner() trafo=)
range=,on.constant=) • makeDiscreteParam(id=,values=c(...))

Introduction • "center"
method getParamSet(learner=)
trafo
• "scale" lower=-2,upper=2,trafo=function(x) 10^x
• "standardize" "classif.qda"
• "range" range=c(0,1) Logical
LogicalVector CharacterVector DiscreteVector
mergeSmallFactorLevels(task=,cols=,min.perc=)
train(learner=,task=) makeTuneControl<type>()
WrappedModel • Grid(resolution=10L)
summarizeColumns(obj=) obj • Random(maxit=100)
• MBO(budget=)
• Irace(n.instances=)
capLargeValues dropFeatures getLearnerModel() • CMAES Design GenSA
removeConstantFeatures summarizeLevels
predict(object=,task=,newdata=) tuneParams(learner=,task=,resampling=,
measures=,par.set=,control=)
pred
View(pred)
makeClassifTask(data=,target=)
as.data.frame(pred)
A B C
positive Quickstart
makeRegrTask(data=,target=)
0 63 100 performance(pred=,measures=)

listMeasures() library(mlbench)
makeMultilabelTask(data=,target=)
• acc auc bac ber brier[.scaled] f1 fdr fn data(Soybean)
A B C fnr fp fpr gmean multiclass[.au1u .aunp .aunu soy = createDummyFeatures(Soybean,target="Class")
.brier] npv ppv qsr ssr tn tnr tp tpr wkappa tsk = makeClassifTask(data=soy,target="Class")
makeClusterTask(data=) • arsq expvar kendalltau mae mape medae ho = makeResampleInstance("Holdout",tsk)
medse mse msle rae rmse rmsle rrse rsq sae tsk.train = subsetTask(tsk,ho$train.inds[[1]])
spearmanrho sse tsk.test = subsetTask(tsk,ho$test.inds[[1]])
makeSurvTask(data=,target= • db dunn G1 G2 silhouette
c("time","event")) • multilabel[.f1 .subset01 .tpr .ppv
.acc .hamloss]
• mcp meancosts
• cindex
makeCostSensTask(data=,costs=) • featperc timeboth timepredict timetrain
A B lrn = makeLearner("classif.xgboost",nrounds=10)
cv = makeResampleDesc("CV",iters=5)
• calculateConfusionMatrix(pred=) res = resample(lrn,tsk.train,cv,acc)
task • calculateROCMeasures(pred=)
• weights=
• blocking=

makeResampleDesc(method=,...,stratify=)
method ps = makeParamSet(makeNumericParam("eta",0,1),
• "CV" iters= makeNumericParam("lambda",0,200),
makeLearner(cl=,predict.type=,...,par.vals=) • "LOO" iters= makeIntegerParam("max_depth",1,20))
• "RepCV" tc = makeTuneControlMBO(budget=100)
reps= folds= tr = tuneParams(lrn,tsk.train,cv5,acc,ps,tc)
• cl= "classif.xgboost" • "Subsample" lrn = setHyperPars(lrn,par.vals=tr$x)
"regr.randomForest" "cluster.kmeans" iters= split= eta lambda max_depth
• predict.type="response" • "Bootstrap" iters=
"prob" • "Holdout" split=
"se" stratify

"prob" "se" makeResampleInstance(desc=,task=)


• par.vals= mdl = train(lrn,tsk.train)
... prd = predict(mdl,tsk.test)
makeLearners() resample(learner=,task=,resampling=,measures=) calculateConfusionMatrix(prd)
mdl = train(lrn,tsk)

• View(listLearners()) cv2
• View(listLearners(task)) cv3 cv5 cv10 hout
• View(listLearners("classif",
properties=c("prob", "factors"))) resample()
"classif" crossval() repcv() holdout() subsample()
"prob" "factors" bootstrapOOB() bootstrapB632() bootstrapB632plus()
• getLearnerProperties()

function(required_parameters=,optional_parameters=)
Configuration Feature Extraction Visualization Wrappers
configureMlr()
• show.info Wrapper 3, etc.
TRUE filterFeatures(task=,method=, generateThreshVsPerfData(obj=,measures=)
• on.learner.error "stop" perc=,abs=,threshold=) Wrapper 2
"warn" Wrapper 1
"quiet" "stop" • plotThreshVsPerf(obj)
• on.learner.warning Learner
"warn" "quiet" "warn" perc= abs= ThreshVsPerfData
• on.par.without.desc threshold= • plotROCCurves(obj)
"stop" "warn" "quiet" "stop"
• on.par.out.of.bounds method ThreshVsPerfData
"stop" "warn" "quiet" "stop" "randomForestSRC.rfsrc" measures=list(fpr,tpr) makeDummyFeaturesWrapper(learner=)
• on.measure.not.applicable "anova.test" "carscore" "cforest.importance" makeImputeWrapper(learner=,classes=,cols=)
"stop" "warn" "quiet" "stop" "chi.squared" "gain.ratio" "information.gain" makePreprocWrapper(learner=,train=,predict=)
• show.learner.output "kruskal.test" "linear.correlation" "mrmr" "oneR" makePreprocWrapperCaret(learner=,...)
TRUE "permutation.importance" "randomForest.importance" • plotResiduals(obj=) makeRemoveConstantFeaturesWrapper(learner=)
• on.error.dump "randomForestSRC.rfsrc" "randomForestSRC.var.select" Prediction BenchmarkResult
on.learner.error "stop" TRUE "rank.correlation" "relief"
"symmetrical.uncertainty" "univariate.model.score"
getMlrOptions() "variance" makeOverBaggingWrapper(learner=)
generateLearningCurveData(learners=,task=, makeSMOTEWrapper(learner=)
resampling=,percs=,measures=) makeUndersampleWrapper(learner=)
makeWeightedClassesWrapper(learner=)

Parallelization selectFeatures(learner=,task=
resampling=,measures=,control=)
• plotLearningCurve(obj=)

LearningCurveData
control makeCostSensClassifWrapper(learner=)
parallelMap
makeCostSensRegrWrapper(learner=)
makeCostSensWeightedPairsWrapper(learner=)
generateFilterValuesData(task=,method=)
• makeFeatSelControlExhaustive(max.features=)
parallelStart(mode=,cpus=,level=) max.features • plotFilterValues(obj=)
• makeMultilabelBinaryRelevanceWrapper(learner=)
• mode makeFeatSelControlRandom(maxit=,prob=,
makeMultilabelClassifierChainsWrapper(learner=)
• "local" mapply max.features=) FilterValuesData
makeMultilabelDBRWrapper(learner=)
• "multicore" prob maxit
makeMultilabelNestedStackingWrapper(learner=)
parallel::mclapply
• makeMultilabelStackingWrapper(learner=)
• "socket" makeFeatSelControlSequential(method=,maxit=,
• "mpi" max.features=,alpha=,beta=) generateHyperParsEffectData(tune.result=)
parallel::makeCluster parallel::clusterMap method "sfs"
• "BatchJobs" "sbs" "sffs"
"sfbs" alpha • plotHyperParsEffect(hyperpars.effec makeBaggingWrapper(learner=)
BatchJobs::batchMap
makeConstantClassWrapper(learner=)
• cpus beta t.data=,x=,y=,z=)
makeDownsampleWrapper(learner=,dw.perc=)
• level "mlr.benchmark"
• makeFeatSelControlGA(maxit=,max.features=,mu=, HyperParsEffectData makeFeatSelWrapper(learner=,resampling=,control=)
"mlr.resample" "mlr.selectFeatures"
lambda=,crossover.rate=,mutation.rate=) makeFilterWrapper(learner=,fw.perc=,fw.abs=,
"mlr.tuneParams" "mlr.ensemble"
• plotOptPath(op=) fw.threshold=)
<obj>$opt.path <obj> makeMultiClassWrapper(learner=)
parallelStop()
mu tuneResult featSelResult makeTuneWrapper(learner=,resampling=,par.set=,
lambda crossover.rate • plotTuneMultiCritResult(res=) control=)

Imputation mutation.rate

impute(obj=,target=,cols=,dummy.cols=,dummy.type=)
selectFeatures FeatSelResult
generatePartialDependenceData(obj=,input=) Nested Resampling
fsr tsk obj
tsk = subsetTask(tsk,features=fsr$x) input
• obj= • plotPartialDependence(obj=)
• target=
• cols= PartialDependenceData
• dummy.cols=
• dummy.type=
classes
"numeric"
dummy.classes cols
Benchmarking • resample benchmark
• makeTuneWrapper
benchmark(learners=,tasks=,resamplings=,measures=) • plotBMRBoxplots(bmr=) makeFeatSelWrapper
cols classes • plotBMRSummary(bmr=)
cols=list(V1=imputeMean()) V1 • plotBMRRanksAsBarChart(bmr=)
imputeMean()

imputeConst(const=) imputeMedian() imputeMode()


Ensembles
imputeMin(multiplier=) imputeMax(multiplier=) getBMR<object> AggrPerformance • generateCritDifferencesData(bmr=,
imputeNormal(mean=,sd=) B makeStackedLearner(base.learners=,super.learner=,
FeatSelResults FilteredFeatures LearnerIds measure=,p.value=,test=)
imputeHist(breaks=,use.mids=) method=)
LeanerShortNames Learners MeasureIds Measures A C
imputeLearner(learner=,features=) • base.learners=
Models Performances Predictions TaskDescs TaskIds "bd" "Nemenyi"
impute 0 1 2 3 • super.learner=
TuneResults • plotCritDifferences(obj=)
• method=
reimpute • "average"
• generateCalibrationData(obj=)
• "stack.nocv" "stack.cv"
agri.task bc.task bh.task costiris.task iris.task
reimpute(obj=,desc=) lung.task mtcars.task pid.task sonar.task
obj desc impute • "hill.climb"
wpbc.task yeast.task • plotCalibration(obj=)
• "compress"
R For Data Science Cheat Sheet Export xts Objects
> data_xts <- as.xts(matrix)
Missing Values
> na.omit(xts5) Omit NA values in xts5
xts > tmp <- tempfile()
> write.zoo(data_xts,sep=",",file=tmp)
> xts_last <- na.locf(xts2) Fill missing values in xts2 using
Learn R for data science Interactively at www.DataCamp.com last observation
> xts_last <- na.locf(xts2, Fill missing values in xts2 using
fromLast=TRUE) next observation
Replace & Update > na.approx(xts2) Interpolate NAs using linear
> xts2[dates] <- 0 Replace values in xts2 on dates with 0 approximation
xts > xts5["1961"] <- NA Replace dates from 1961 with NA

eXtensible Time Series (xts) is a powerful package that


> xts2["2016-05-02"] <- NA Replace the value at 1 specific index with NA
Arithmetic Operations
provides an extensible time series class, enabling uniform coredata()or as.numeric()
handling of many R time series classes by extending zoo.
Applying Functions
> ep1 <- endpoints(xts4,on="weeks",k=2) Take index values by time > xts3 + as.numeric(xts2) Addition
[1] 0 5 10 > xts3 * as.numeric(xts4)
Load the package as follows: > ep2 <- endpoints(xts5,on="years")
Multiplication
> coredata(xts4) - xts3 Subtraction
> library(xts) [1] 0 12 24 36 48 60 72 84 96 108 120 132 144 > coredata(xts4) / xts3 Division
> period.apply(xts5,INDEX=ep2,FUN=mean) Calculate the yearly mean
xts Objects > xts5_yearly <- split(xts5,f="years") Split xts5 by year Shifting Index Values
> lapply(xts5_yearly,FUN=mean) Create a list of yearly means
xts objects have three main components: > do.call(rbind, Find the last observation in > xts5 - lag(xts5) Period-over-period differences
- coredata: always a matrix for xts objects, while it could also be a lapply(split(xts5,"years"), each year in xts5 > diff(xts5,lag=12,differences=1) Lagged differences
vector for zoo objects function(w) last(w,n="1 month")))
> do.call(rbind, Calculate cumulative annual
- index: vector of any Date, POSIXct, chron, yearmon, lapply(split(xts5,"years"), passengers
Reindexing
yearqtr, or DateTime classes cumsum)) > xts1 + merge(xts2,index(xts1),fill=0) Addition
- xtsAttributes: arbitrary attributes > rollapply(xts5, 3, sd) Apply sd to rolling margins of xts5 e1
2017-05-04 5.231538
2017-05-05 5.829257
Creating xts Objects Selecting, Subsetting & Indexing 2017-05-06
2017-05-07
4.000000
3.000000
2017-05-08 2.000000
> xts1 <- xts(x=1:10, order.by=Sys.Date()-1:10) Select 2017-05-09 1.000000
> data <- rnorm(5)
> xts1 - merge(xts2,index(xts1),fill=na.locf) Subtraction
> dates <- seq(as.Date("2017-05-01"),length=5,by="days") > mar55 <- xts5["1955-03"] Get value for March 1955 e1
> xts2 <- xts(x=data, order.by=dates) 2017-05-04 5.231538
> xts3 <- xts(x=rnorm(10), Subset 2017-05-05
2017-05-06
5.829257
4.829257
order.by=as.POSIXct(Sys.Date()+1:10), 2017-05-07 3.829257
born=as.POSIXct("1899-05-08")) > xts5_1954 <- xts5["1954"] Get all data from 1954 2017-05-08 2.829257
> xts4 <- xts(x=1:10, order.by=Sys.Date()+1:10) > xts5_janmarch <- xts5["1954/1954-03"] Extract data from Jan to March ‘54 2017-05-09 1.829257
> xts5_janmarch <- xts5["/1954-03"] Get all data until March ‘54
Convert To And From xts > xts4[ep1] Subset xts4 using ep2
Merging
> data(AirPassengers) first() and last() > merge(xts2,xts1,join='inner') Inner join of xts2 and xts1
> xts5 <- as.xts(AirPassengers)
> first(xts4,'1 week') Extract first 1 week xts2 xts1
2017-05-05 -0.8382068 10
Import From Files > first(last(xts4,'1 week'),'3 days') Get first 3 days of the last week of data
> merge(xts2,xts1,join='left',fill=0) Left join of xts2 and xts1,
> dat <- read.csv(tmp_file) Indexing fill empty spots with 0
> xts(dat, order.by=as.Date(rownames(dat),"%m/%d/%Y")) xts2 xts1
2017-05-01 1.7482704 0
> dat_zoo <- read.zoo(tmp_file, > xts2[index(xts3)] Extract rows with the index of xts3 2017-05-02 -0.2314678 0
index.column=0, > days <- c("2017-05-03","2017-05-23") 2017-05-03 0.1685517 0
sep=",", > xts3[days] Extract rows using the vector days 2017-05-04 1.1685649 0
2017-05-05 -0.8382068 10
format="%m/%d/%Y") > xts2[as.POSIXct(days,tz="UTC")] Extract rows using days as POSIXct
> dat_zoo <- read.zoo(tmp,sep=",",FUN=as.yearmon) > index <- which(.indexwday(xts1)==0|.indexwday(xts1)==6) Index of weekend days > rbind(xts1, xts4) Combine xts1 and xts4 by
> dat_xts <- as.xts(dat_zoo) > xts1[index] Extract weekend days of xts1 rows

Inspect Your Data


> core_data <- coredata(xts2) Extract core data of objects Periods, Periodicity & Timestamps Other Useful Functions
> index(xts1) Extract index of objects
> periodicity(xts5) Estimate frequency of observations > .index(xts4) Extract raw numeric index of xts1
Class Attributes > to.yearly(xts5) Convert xts5 to yearly OHLC > .indexwday(xts3) Value of week(day), starting on Sunday,
> to.monthly(xts3) Convert xts3 to monthly OHLC in index of xts3
> to.quarterly(xts5) Convert xts5 to quarterly OHLC > .indexhour(xts3) Value of hour in index of xts3
> indexClass(xts2) Get index class > start(xts3) Extract first observation of xts3
> indexClass(convertIndex(xts,'POSIXct')) Replacing index class > to.period(xts5,period="quarters") Convert to quarterly OHLC > end(xts4) Extract last observation of xts4
> indexTZ(xts5) Get index class > to.period(xts5,period="years") Convert to yearly OHLC > str(xts3) Display structure of xts3
> indexFormat(xts5) <- "%Y-%m-%d" Change format of time display > nmonths(xts5) Count the months in xts5 > time(xts1) Extract raw numeric index of xts1
> nquarters(xts5) Count the quarters in xts5 > head(xts2) First part of xts2
Time Zones > nyears(xts5) Count the years in xts5 > tail(xts2) Last part of xts2
> make.index.unique(xts3,eps=1e-4) Make index unique
> tzone(xts1) <- "Asia/Hong_Kong" Change the time zone > make.index.unique(xts3,drop=TRUE) Remove duplicate times
> tzone(xts1) Extract the current time zone > align.time(xts3,n=3600) Round index time to the next n seconds DataCamp
Learn R for Data Science Interactively
Time Series Cheat Sheet
Plot Time Series Filters Partial Auto-correlation function: pacf() Forecasting
pacf(data, na.action=na.pass)
1. tsplot(x=time, y=data) Linear Filter: filter()
OR: acf(data, type=‘partial’, na.action=na.pass)
filter(data, filter=filter_coefficients, sides=2, Forecasting future observations
given a fitted ARMA model
method="convolution", circular=F)

predict(): Predict future observations given a


2. plot(ts(data, start=start_time, frequency=gap))
fitted ARMA model
predict(arima_model, number_to_predict)

Plot Predicted values and Confidence Interval:


Differencing Filter: diff()
fit<-predict(arima_model, number_to_predict)
diff(data, lag=4, differences=1) ts.plot(data,
3. ts.plot(ts(data, start=start_time, frequency=gap)) Parameter Estimation xlim=c(1, length(data)+number_to_predict),

Fit an ARMA time series model to the data ylim=c(0, max(fit$pred+1.96*fit$se)))


lines(length(data)+1:length(data)+
number_to_predict, fit$pred)
ar(): To estimate parameters of an AR model
ar(x=data, aic=T, order.max = NULL,

Simulation c("yule-walker", "burg", "ols", "mle", "yw"))

Autoregression of Order p
Auto-correlation
X t = ϕ1X t−1 + ϕ2 X t−2 + … + ϕp X t−p + Wt
Use ACF and PACF to detect model
Moving Average of Order q
X t = Zt + θ1Zt−1 + θ2 Zt−2 + … + θq Zt−p arima(): To estimate parameters of an AM or
(Complete) Auto-correlation function: acf() OR: autoplot(forecast(arima_model, level=c(95),
ARMA model, and build model
ARMA (p, q) h=number_to_predict))
acf(data, type=‘correlation’, na.action=na.pass) arima(data, order=c(p, 0, q),method=c(‘ML’))
X t = ϕ1X t−1 + ϕ2 X t−2 + … + ϕp X t−p+
Zt + θ1Zt−1 + θ2 Zt−2 + … + θq Zt−p

Simulation of ARMA (p, q)

arima.sim(model=list(ar=c(ϕ1, . . . , ϕp ), AICc(): Compare models using AICC


ma=c(θ1, . . . , θq )), n=n) AICc(fittedModel)

RStudio® is a trademark of RStudio, Inc. • CC BY SA Yunjun Xia, Shuyu Huang • yx2569@columbia.edu, sh3967@columbia.edu • Updated: 2019-10
Deep Learning with Keras : : CHEAT SHEET Keras TensorFlow

Intro Define
INSTALLATION
The keras R package uses the Python keras library.
Compile Fit Evaluate Predict
Keras is a high-level neural networks API You can install all the prerequisites directly from R.
developed with a focus on enabling fast • Model • Batch size
• Sequential • Optimiser • Epochs • Evaluate • classes https://keras.rstudio.com/reference/install_keras.html
experimentation. It supports multiple back-
ends, including TensorFlow, CNTK and Theano. model • Loss • Validation • Plot • probability
library(keras) See ?install_keras
• Multi-GPU • Metrics split
install_keras() for GPU instructions
TensorFlow is a lower level mathematical model
library for building deep neural network This installs the required libraries in an Anaconda
architectures. The keras R package makes it https://keras.rstudio.com The “Hello, World!” environment or virtual environment 'r-tensorflow'.
easy to use Keras and TensorFlow in R. https://www.manning.com/books/deep-learning-with-r of deep learning

TRAINING AN IMAGE RECOGNIZER ON MNIST DATA


Working with keras models # input layer: use MNIST images
DEFINE A MODEL PREDICT CORE LAYERS mnist <- dataset_mnist()
keras_model() Keras Model x_train <- mnist$train$x; y_train <- mnist$train$y
predict() Generate predictions from a Keras model
layer_input() Input layer x_test <- mnist$test$x; y_test <- mnist$test$y
keras_model_sequential() Keras Model composed of
a linear stack of layers predict_proba() and predict_classes()
Generates probability or class probability predictions layer_dense() Add a densely- # reshape and rescale
for the input samples connected NN layer to an output x_train <- array_reshape(x_train, c(nrow(x_train), 784))
multi_gpu_model() Replicates a model on different
GPUs x_test <- array_reshape(x_test, c(nrow(x_test), 784))
predict_on_batch() Returns predictions for a single layer_activation() Apply an x_train <- x_train / 255; x_test <- x_test / 255
batch of samples activation function to an output
COMPILE A MODEL layer_dropout() Applies Dropout y_train <- to_categorical(y_train, 10)
predict_generator() Generates predictions for the
to the input y_test <- to_categorical(y_test, 10)
compile(object, optimizer, loss, metrics = NULL) input samples from a data generator
Configure a Keras model for training
layer_reshape() Reshapes an # defining the model and layers
output to a certain shape model <- keras_model_sequential()
FIT A MODEL OTHER MODEL OPERATIONS model %>%
layer_dense(units = 256, activation = 'relu',

fit(object, x = NULL, y = NULL, batch_size = NULL, summary() Print a summary of a Keras model layer_permute() Permute the
epochs = 10, verbose = 1, callbacks = NULL, …) input_shape = c(784)) %>%
dimensions of an input according
Train a Keras model for a fixed number of epochs to a given pattern layer_dropout(rate = 0.4) %>%
(iterations) export_savedmodel() Export a saved model layer_dense(units = 128, activation = 'relu') %>%
n layer_repeat_vector() Repeats layer_dense(units = 10, activation = 'softmax’)
fit_generator() Fits the model on data yielded batch- get_layer() Retrieves a layer based on either its the input n times
by-batch by a generator name (unique) or index
# compile (define loss and optimizer)
pop_layer() Remove the last layer in a model x f(x) layer_lambda(object, f) Wraps model %>% compile(
train_on_batch() test_on_batch() Single gradient arbitrary expression as a layer
update or model evaluation over one batch of loss = 'categorical_crossentropy',
samples save_model_hdf5(); load_model_hdf5() Save/ optimizer = optimizer_rmsprop(),
L1 L2 layer_activity_regularization()
Load models using HDF5 files Layer that applies an update to metrics = c('accuracy’)
the cost function based input )
EVALUATE A MODEL serialize_model(); unserialize_model() activity
Serialize a model to an R object # train (fit)
layer_masking() Masks a model %>% fit(
evaluate(object, x = NULL, y = NULL, batch_size = clone_model() Clone a model instance sequence by using a mask value to
NULL) Evaluate a Keras model x_train, y_train,
skip timesteps
epochs = 30, batch_size = 128,
freeze_weights(); unfreeze_weights()
evaluate_generator() Evaluates the model on a data layer_flatten() Flattens an input validation_split = 0.2
generator Freeze and unfreeze weights
)
model %>% evaluate(x_test, y_test)
model %>% predict_classes(x_test)

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
More layers Preprocessing
CONVOLUTIONAL LAYERS ACTIVATION LAYERS SEQUENCE PREPROCESSING Keras TensorFlow
layer_conv_1d() 1D, e.g.
temporal convolution
layer_activation(object, activation)
Apply an activation function to an output
pad_sequences()
Pads each sequence to the same length (length of Pre-trained models
the longest sequence)
layer_activation_leaky_relu() Keras applications are deep learning models
layer_conv_2d_transpose() Leaky version of a rectified linear unit skipgrams() that are made available alongside pre-trained
Transposed 2D (deconvolution) Generates skipgram word pairs weights. These models can be used for
α layer_activation_parametric_relu() prediction, feature extraction, and fine-tuning.
layer_conv_2d() 2D, e.g. spatial Parametric rectified linear unit make_sampling_table() application_xception()

convolution over images Generates word rank-based probabilistic sampling xception_preprocess_input()

layer_activation_thresholded_relu() table Xception v1 model

Thresholded rectified linear unit
layer_conv_3d_transpose()
Transposed 3D (deconvolution) layer_activation_elu() TEXT PREPROCESSING application_inception_v3()

layer_conv_3d() 3D, e.g. spatial Exponential linear unit inception_v3_preprocess_input()
text_tokenizer() Text tokenization utility Inception v3 model, with weights pre-trained
convolution over volumes
on ImageNet
fit_text_tokenizer() Update tokenizer internal
layer_conv_lstm_2d() vocabulary
Convolutional LSTM DROPOUT LAYERS application_inception_resnet_v2()

save_text_tokenizer(); load_text_tokenizer() inception_resnet_v2_preprocess_input()
layer_separable_conv_2d() layer_dropout() Inception-ResNet v2 model, with weights
Depthwise separable 2D Save a text tokenizer to an external file
Applies dropout to the input trained on ImageNet
layer_upsampling_1d() texts_to_sequences();
layer_spatial_dropout_1d() texts_to_sequences_generator() application_vgg16(); application_vgg19()
layer_upsampling_2d() layer_spatial_dropout_2d()
layer_upsampling_3d() Transforms each text in texts to sequence of integers VGG16 and VGG19 models
layer_spatial_dropout_3d()
Upsampling layer Spatial 1D to 3D version of dropout texts_to_matrix(); sequences_to_matrix() application_resnet50() ResNet50 model
layer_zero_padding_1d() Convert a list of sequences into a matrix
layer_zero_padding_2d() application_mobilenet()

layer_zero_padding_3d() RECURRENT LAYERS text_one_hot() One-hot encode text to word indices mobilenet_preprocess_input()

Zero-padding layer mobilenet_decode_predictions()

layer_simple_rnn() text_hashing_trick()
Fully-connected RNN where the output mobilenet_load_model_hdf5()
layer_cropping_1d() Converts a text to a sequence of indexes in a fixed-
layer_cropping_2d() is to be fed back to input MobileNet model architecture
size hashing space
layer_cropping_3d()
Cropping layer layer_gru() text_to_word_sequence()
Gated recurrent unit - Cho et al Convert text to a sequence of words (or tokens) ImageNet is a large database of images with
POOLING LAYERS
layer_cudnn_gru() labels, extensively used for deep learning
layer_max_pooling_1d() Fast GRU implementation backed IMAGE PREPROCESSING
layer_max_pooling_2d() by CuDNN imagenet_preprocess_input()
layer_max_pooling_3d() image_load() Loads an image into PIL format. imagenet_decode_predictions()
Maximum pooling for 1D to 3D layer_lstm() Preprocesses a tensor encoding a batch of
Long-Short Term Memory unit - flow_images_from_data() images for ImageNet, and decodes predictions
layer_average_pooling_1d() Hochreiter 1997 flow_images_from_directory()
layer_average_pooling_2d()
layer_average_pooling_3d()
Average pooling for 1D to 3D
layer_cudnn_lstm()
Fast LSTM implementation backed
Generates batches of augmented/normalized data
from images and labels, or a directory Callbacks
by CuDNN A callback is a set of functions to be applied at
layer_global_max_pooling_1d() image_data_generator() Generate minibatches of
image data with real-time data augmentation. given stages of the training procedure. You can
layer_global_max_pooling_2d() use callbacks to get a view on internal states
LOCALLY CONNECTED LAYERS
layer_global_max_pooling_3d() and statistics of the model during training.
Global maximum pooling fit_image_data_generator() Fit image data
layer_locally_connected_1d() generator internal statistics to some sample data callback_early_stopping() Stop training when
layer_global_average_pooling_1d() layer_locally_connected_2d() a monitored quantity has stopped improving
layer_global_average_pooling_2d() Similar to convolution, but weights are not generator_next() Retrieve the next item callback_learning_rate_scheduler() Learning
layer_global_average_pooling_3d() shared, i.e. different filters for each patch rate scheduler
Global average pooling image_to_array(); image_array_resize()
 callback_tensorboard() TensorBoard basic
image_array_save() 3D array representation visualizations
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at keras.rstudio.com • keras 2.1.2 • Updated: 2017-12
Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions a er any special characters [:blank:] .
have been parsed. string regexp matches example
(type this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:] [:symbol:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! / * @# | ` = + ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} - _ " ' [ ] { } ( ) ~ < > $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
a er all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B C D E F
writeLines("\\.") [:alnum:]
1
letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){}
# \. g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
mn o p q r MNOPQR
writeLines("\\ is a backslash") [:graph:] 1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
# \ is a backslash [:space:] 1 space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u v w x S T U VWX
[:blank:] 1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} y z Y Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs. To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
regexp matches regexp matches
FALSE, comments = FALSE, dotall = FALSE, ...)
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between
characters, line_breaks, sentences, or words. regexp matches example Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))

(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.4.0+ • Updated: 2021-08
ft

ft

Data visualization with ggplot2 : : CHEAT SHEET


Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables.
Each function returns a layer.
ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same GRAPHICAL PRIMITIVES TWO VARIABLES

components: a data set, a coordinate system, a <- ggplot(economics, aes(date, unemploy))


both continuous
continuous bivariate distribution

and geoms—visual marks that represent data points. b <- ggplot(seals, aes(x = long, y = lat)) e <- ggplot(mpg, aes(cty, hwy)) h <- ggplot(diamonds, aes(carat, price))

F M A a + geom_blank() and a + expand_limits()
 e + geom_label(aes(label = cty), nudge_x = 1, h + geom_bin2d(binwidth = c(0.25, 500))



Ensure limits include values across all plots.
nudge_y = 1) - x, y, label, alpha, angle, color, x, y, alpha, color, fill, linetype, size, weight

+ = b + geom_curve(aes(yend = lat + 1, 

family, fontface, hjust, lineheight, size, vjust

h + geom_density_2d()

xend = long + 1), curvature = 1) - x, xend, y, yend, e + geom_point() 
 x, y, alpha, color, group, linetype, size

data geom
coordinate plot alpha, angle, color, curvature, linetype, size

x=F·y=A system x, y, alpha, color, fill, shape, size, stroke

a + geom_path(lineend = "butt", 
 h + geom_hex()



To display values, map variables in the data to visual linejoin = "round", linemitre = 1) 
 e + geom_quantile() 
 x, y, alpha, color, fill, size
properties of the geom (aesthetics) like size, color, and x x, y, alpha, color, group, linetype, size
x, y, alpha, color, group, linetype, size, weight

and y locations.
a + geom_polygon(aes(alpha = 50)) - x, y, alpha, e + geom_rug(sides = “bl") 
 continuous function

F M A color, fill, group, subgroup, linetype, size


x, y, alpha, color, linetype, size
i <- ggplot(economics, aes(date, unemploy))

+ = b + geom_rect(aes(xmin = long, ymin = lat, 



xmax = long + 1, ymax = lat + 1)) - xmax, xmin,
e + geom_smooth(method = lm) 

x, y, alpha, color, fill, group, linetype, size, weight

i + geom_area()

x, y, alpha, color, fill, linetype, size

data geom
coordinate plot ymax, ymin, alpha, color, fill, linetype, size

x = F · y = A
system e + geom_text(aes(label = cty), nudge_x = 1, i + geom_line()

color = F
a + geom_ribbon(aes(ymin = unemploy - 900, nudge_y = 1) - x, y, label, alpha, angle, color,
size = A ymax = unemploy + 900)) - x, ymax, ymin, 
 x, y, alpha, color, group, linetype, size

family, fontface, hjust, lineheight, size, vjust


alpha, color, fill, group, linetype, size
i + geom_step(direction = "hv")

Complete the template below to build a graph. x, y, alpha, color, group, linetype, size
required LINE SEGMENTS

ggplot (data = <DATA> ) + common aesthetics: x, y, alpha, color, linetype, size


one discrete, one continuous
visualizing error

<GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ), b + geom_abline(aes(intercept = 0, slope = 1))


f <- ggplot(mpg, aes(class, hwy)) df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)

stat = <STAT> , position = <POSITION> ) + Not 
 b + geom_hline(aes(yintercept = lat))


j <- ggplot(df, aes(grp, fit, ymin = fit - se, ymax = fit + se))
<COORDINATE_FUNCTION> +
required, b + geom_vline(aes(xintercept = long))
sensible f + geom_col() 
 j + geom_crossbar(fatten = 2) - x, y, ymax, 

<FACET_FUNCTION> +
defaults b + geom_segment(aes(yend = lat + 1, xend = long + 1))
x, y, alpha, color, fill, group, linetype, size
ymin, alpha, color, fill, group, linetype, size

supplied b + geom_spoke(aes(angle = 1:1155, radius = 1))


<SCALE_FUNCTION> +

f + geom_boxplot() 
 j + geom_errorbar() - x, ymax, ymin, 



<THEME_FUNCTION> x, y, lower, middle, upper, ymax, ymin, alpha, alpha, color, group, linetype, size, width
color, fill, group, linetype, shape, size, weight
Also geom_errorbarh().

ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot ONE VARIABLE continuous
that you finish by adding layers to. Add one geom c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg) f + geom_dotplot(binaxis = "y", stackdir = “center") j + geom_linerange()

function per layer.
x, y, alpha, color, fill, group
x, ymin, ymax, alpha, color, group, linetype, size

c + geom_area(stat = "bin")

last_plot() Returns the last plot.
x, y, alpha, color, fill, linetype, size
f + geom_violin(scale = “area") 
 j + geom_pointrange() - x, y, ymin, ymax, 

x, y, alpha, color, fill, group, linetype, size, weight alpha, color, fill, group, linetype, shape, size
ggsave("plot.png", width = 5, height = 5) Saves last plot c + geom_density(kernel = "gaussian")

as 5’ x 5’ file named "plot.png" in working directory. x, y, alpha, color, fill, group, linetype, size, weight

Matches file type to file extension. both discrete


maps

c + geom_dotplot()
g <- ggplot(diamonds, aes(cut, color)) data <- data.frame(murder = USArrests$Murder,

x, y, alpha, color, fill

state = tolower(rownames(USArrests)))

Aes Common aesthetic values. c + geom_freqpoly() 

x, y, alpha, color, group, linetype, size

g + geom_count() 

x, y, alpha, color, fill, shape, size, stroke

map <- map_data("state")



k <- ggplot(data, aes(fill = murder))
color and fill - string ("red", "#RRGGBB")
k + geom_map(aes(map_id = state), map = map)
e + geom_jitter(height = 2, width = 2)
linetype - integer or string (0 = "blank", 1 = "solid", 
 c + geom_histogram(binwidth = 5) 
 x, y, alpha, color, fill, shape, size + expand_limits(x = map$long, y = map$lat)
2 = "dashed", 3 = "dotted", 4 = "dotdash", 5 = "longdash", x, y, alpha, color, fill, linetype, size, weight
map_id, alpha, color, fill, linetype, size
6 = "twodash")

c2 + geom_qq(aes(sample = hwy)) 

lineend - string ("round", "butt", or "square")
x, y, alpha, color, fill, linetype, size, weight THREE VARIABLES

linejoin - string ("round", "mitre", or "bevel")


seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2)); l <- ggplot(seals, aes(long, lat))
size - integer (line width in mm)
l + geom_contour(aes(z = z)) l + geom_raster(aes(fill = z), hjust = 0.5,
discrete
 x, y, z, alpha, color, group, linetype, size, weight vjust = 0.5, interpolate = FALSE)
shape - integer/shape name or 
 d <- ggplot(mpg, aes(fl))
a single character ("a") x, y, alpha, fill
d + geom_bar() 
 l + geom_contour_filled(aes(fill = z)) l + geom_tile(aes(fill = z))
x, alpha, color, fill, linetype, size, weight x, y, alpha, color, fill, group, linetype, size, subgroup x, y, alpha, color, fill, linetype, size, width

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
Stats An alternative way to build a layer. Scales Override defaults with scales package. Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) - xlim, ylim subplots based on the
n <- d + geom_bar(aes(fill = fl)) The default cartesian coordinate system. values of one or more

+ =
x ..count..
discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim - Cartesian coordinates with t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot
x=x· system n + scale_fill_manual( fixed aspect ratio between x and y units.
y = ..count.. values = c("skyblue", "royalblue", "blue", "navy"), t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), ggplot(mpg, aes(y = fl)) + geom_bar() Facet into columns based on fl.
name = "fuel", labels = c("D", "E", "P", "R")) Flip cartesian coordinates by switching
function, geom_bar(stat="count") or by using a stat
x and y aesthetic mappings. t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in
values to include legend/axis in legend/axis legend/axis Facet into rows based on year.
geom to make a layer (equivalent to a geom function). in mapping
Use ..name.. syntax to map stat variables to aesthetics. r + coord_polar(theta = "x", direction=1)
theta, start, direction - Polar coordinates. t + facet_grid(rows = vars(year), cols = vars(fl))
GENERAL PURPOSE SCALES Facet into both rows and columns.
geom to use stat function geommappings r + coord_trans(y = “sqrt") - x, y, xlim, ylim t + facet_wrap(vars(fl))
Use with most aesthetics Transformed cartesian coordinates. Set xtrans
i + stat_density_2d(aes(fill = ..level..), Wrap facets into a rectangular layout.
scale_*_continuous() - Map cont’ values to visual ones. and ytrans to the name of a window function.
geom = "polygon")
variable created by stat scale_*_discrete() - Map discrete values to visual ones. Set scales to let axis limits vary across facets.
scale_*_binned() - Map continuous values to discrete bins. π + coord_quickmap()
60
π + coord_map(projection = "ortho", orientation t + facet_grid(rows = vars(drv), cols = vars(fl),
c + stat_bin(binwidth = 1, boundary = 10) scale_*_identity() - Use data values as visual ones. = c(41, -74, 0)) - projection, xlim, ylim scales = "free")

lat
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - Map discrete values to Map projections from the mapproj package x and y axis limits adjust to individual facets:
manually chosen visual ones.
c + stat_count(width = 1) x, y | ..count.., ..prop.. long
(mercator (default), azequalarea, lagrange, etc.). "free_x" - x axis limits adjust
scale_*_date(date_labels = "%m/%d"), "free_y" - y axis limits adjust
c + stat_density(adjust = 1, kernel = "gaussian") date_breaks = "2 weeks") - Treat data values as dates.
x, y | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)
scale_*_datetime() - Treat data values as date times.
Same as scale_*_date(). See ?strptime for label formats.
Position Adjustments Set labeller to adjust facet label:
t + facet_grid(cols = vars(fl), labeller = label_both)
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms fl: c fl: d fl: e fl: p fl: r
X & Y LOCATION SCALES that would otherwise occupy the same space.
e + stat_bin_hex(bins = 30) x, y, fill | ..count.., ..density.. t + facet_grid(rows = vars(fl),
Use with x or y aesthetics (x shown here) s <- ggplot(mpg, aes(fl, fill = drv)) labeller = label_bquote(alpha ^ .(fl)))
e + stat_density_2d(contour = TRUE, n = 100)
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale. ↵c ↵d ↵e ↵p ↵r
scale_x_reverse() - Reverse the direction of the x axis. s + geom_bar(position = "dodge")
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_sqrt() - Plot x on square root scale. Arrange elements side by side.
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
l + stat_summary_hex(aes(z = z), bins = 30, fun = max) COLOR AND FILL SCALES (DISCRETE)
s + geom_bar(position = "fill")
Stack elements on top of one
Labels and Legends
x, y, z, fill | ..value.. another, normalize height. Use labs() to label the elements of your plot.
n + scale_fill_brewer(palette = "Blues")
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean) For palette choices: e + geom_point(position = "jitter") t + labs(x = "New x axis label", y = "New y axis label",
x, y, z, fill | ..value.. RColorBrewer::display.brewer.all() Add random noise to X and Y position of title ="Add a title above the plot",
each element to avoid overplotting. subtitle = "Add a subtitle below title",
f + stat_boxplot(coef = 1.5) n + scale_fill_grey(start = 0.2, A caption = "Add a caption below plot",
x, y | ..lower.., ..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. end = 0.8, na.value = "red") e + geom_label(position = "nudge") alt = "Add alt text to the plot",
B
Nudge labels away from points. <aes> = "New <aes>
<AES> <AES> legend title")
f + stat_ydensity(kernel = "gaussian", scale = "area") x, y
| ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. COLOR AND FILL SCALES (CONTINUOUS) s + geom_bar(position = "stack") t + annotate(geom = "text", x = 8, y = 9, label = “A")
Stack elements on top of one another. Places a geom with manually selected aesthetics.
e + stat_ecdf(n = 40) x, y | ..x.., ..y.. o <- c + geom_dotplot(aes(fill = ..x..))
e + stat_quantile(quantiles = c(0.1, 0.9), Each position adjustment can be recast as a function p + guides(x = guide_axis(n.dodge = 2)) Avoid crowded
o + scale_fill_distiller(palette = “Blues”) with manual width and height arguments: or overlapping labels with guide_axis(n.dodge or angle).
formula = y ~ log(x), method = "rq") x, y | ..quantile..
s + geom_bar(position = position_dodge(width = 1)) n + guides(fill = “none") Set legend type for each
e + stat_smooth(method = "lm", formula = y ~ x, se = T, o + scale_fill_gradient(low="red", high=“yellow") aesthetic: colorbar, legend, or none (no legend).
level = 0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
ggplot() + xlim(-5, 5) + stat_function(fun = dnorm,
o + scale_fill_gradient2(low = "red", high = “blue”,
mid = "white", midpoint = 25) Themes n + theme(legend.position = "bottom")
Place legend at "bottom", "top", "le ", or “right”.
n = 20, geom = “point”) x | ..x.., ..y.. n + scale_fill_discrete(name = "Title",
ggplot() + stat_qq(aes(sample = 1:100)) o + scale_fill_gradientn(colors = topo.colors(6)) r + theme_bw() r + theme_classic() labels = c("A", "B", "C", "D", "E"))
x, y, sample | ..sample.., ..theoretical.. Also: rainbow(), heat.colors(), terrain.colors(), White background Set legend title and labels with a scale function.
cm.colors(), RColorBrewer::brewer.pal() with grid lines. r + theme_light()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot")
h + stat_summary_bin(fun = "mean", geom = "bar")
SHAPE AND SIZE SCALES
r + theme_gray()
Grey background
r + theme_linedraw()
r + theme_minimal()
Zooming
p <- e + geom_point(aes(shape = fl, size = cyl)) (default theme). Minimal theme. Without clipping (preferred):
e + stat_identity() p + scale_shape() + scale_size() r + theme_dark() r + theme_void() t + coord_cartesian(xlim = c(0, 100), ylim = c(10, 20))
e + stat_unique() p + scale_shape_manual(values = c(3:7)) Dark for contrast. Empty theme.
With clipping (removes unseen data points):
r + theme() Customize aspects of the theme such
as axis, legend, panel, and facet properties. t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) r + ggtitle(“Title”) + theme(plot.title.postion = “plot”) t + scale_x_continuous(limits = c(0, 100)) +
r + theme(panel.background = element_rect(fill = “blue”)) scale_y_continuous(limits = c(0, 100))

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08


















ft




























RStudio IDE : : CHEAT SHEET


Documents and Apps Source Editor Tab Panes Version
Open Shiny, R Markdown,
knitr, Sweave, LaTeX, .Rd files
Navigate

forwards
Open in new Save Find and
backwards/ window replace
Compile as Run
notebook selected
code
Import data History of past
with wizard commands to
run/copy
Manage
external
View
memory
databases usage
R tutorials
Control
and more in Source Pane Turn on at Tools > Project Options > Git/SVN
A• Added M• Modified
Check Render Choose Configure Insert D• Deleted R• Renamed
?• Untracked

spelling output output render code Publish


format options chunk to server
Stage Commit Push/Pull View Current
Re-run Source with or Show file files: staged files to remote History branch
previous code w/out Echo or outline Load Save Clear R Search inside
as a Local Job workspace workspace workspace environment
Jump to Jump Run Show file Visual Multiple cursors/column selection Choose environment to display from Display objects
previous to next code outline Editor with Alt + mouse drag. list of parent environments as list or grid Open shell to type commands
chunk chunk (reverse
side) Code diagnostics that appear in the margin. Show file di to view file di erences
Run this and Hover over diagnostic symbols for details.
Jump to all previous Run this Syntax highlighting based
section code chunks code chunk on your file's extension
or chunk Tab completion to finish function
Set knitr Displays saved objects by View in data View function
chunk names, file paths, arguments, and more. type with short description viewer source code
options

Access markdown guide at


Help > Markdown Quick Reference
See reverse side for more on Visual Editor
Multi-language code snippets to
quickly use common blocks of code. More file
options
Debug Mode
Jump to function in file Change file type Use debug(), browser(), or a breakpoint and execute
Create Delete Rename Change
folder file file directory your code to open the debugger mode.
RStudio recognizes that files named app.R,
server.R, ui.R, and global.R belong to a shiny app Path to displayed directory Launch debugger Open traceback to examine
mode from origin the functions that R called
Working Run scripts in Maximize, of error before the error occurred
Directory separate sessions minimize panes
A File browser keyed to your working directory.
Run Choose Publish to Manage Click on file or directory name to open.
app location to shinyapps.io publish Ctrl/Cmd + R Markdown Drag pane
view app or server accounts to see history Build Log boundaries

Package Development
Click next to line number to Highlighted line shows where
RStudio opens plots in a dedicated Plots pane RStudio opens documentation in a dedicated Help pane add/remove a breakpoint. execution has paused
Create a new package with
File > New Project > New Directory > R Package
Navigate Open in Export Delete Delete
Enable roxygen documentation with recent plots window plot plot all plots Home page of Search within Search for
Tools > Project Options > Build Tools helpful links help file help file
Roxygen guide at Help > Roxygen Quick Reference
See package information in the Build Tab Viewer pane displays HTML content, such as Shiny
apps, RMarkdown reports, and interactive visualizations
GUI Package manager lists every installed package
Install package Run devtools::load_all()
and restart R and reload changes
Stop Shiny Publish to shinyapps.io, Refresh
Install Update Browse app rpubs, RSConnect, … Run commands in Examine variables Select function
Packages Packages package site environment where in executing in traceback to
Clear output execution has paused environment debug
Run R CMD and rebuild
check View(<data>) opens spreadsheet like view of data set
Customize Run Click to load package with Package Delete
package build package library(). Unclick to detach version from
options tests package with detach(). installed library

Filter rows by value Sort by Search Step through Step into and Resume Quit debug
or value range values for value code one line out of functions execution mode
at a time to run
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07

a
ff



ff



Keyboard Shortcuts RStudio

RUN CODE
Search command history
Windows/Linux
Ctrl+a
Mac
Cmd+a
DOCUMENTS AND APPS
Knit Document (knitr) Ctrl+Shift+K Cmd+Shift+K
Workbench
Interrupt current command Esc Esc Insert chunk (Sweave & Knitr) Ctrl+Alt+I Cmd+Option+I WHY RSTUDIO WORKBENCH?
Clear console Ctrl+L Ctrl+L Run from start to current line Ctrl+Alt+B Cmd+Option+B Extend the open source server with a 

commercial license, support, and more:
NAVIGATE CODE MORE KEYBOARD SHORTCUTS
Go to File/Function Ctrl+. Ctrl+. Keyboard Shortcuts Help Alt+Shift+K Option+Shift+K • open and run multiple R sessions at once
Show Command Palette Ctrl+Shift+P Cmd+Shift+P • tune your resources to improve performance
WRITE CODE
Attempt completion Tab or Tab or
• administrative tools for managing user sessions
Ctrl+Space Ctrl+Space View the Keyboard Shortcut Quick Search for keyboard shortcuts with • collaborate real-time with others in shared projects
Insert <- (assignment operator) Alt+- Option+- Reference with Tools > Keyboard Tools > Show Command Palette • switch easily from one version of R to a different version
Shortcuts or Alt/Option + Shift + K or Ctrl/Cmd + Shift + P.
Insert %>% (pipe operator) Ctrl+Shift+M Cmd+Shift+M • integrate with your authentication, authorization, and audit practices
(Un)Comment selection Ctrl+Shift+C Cmd+Shift+C • work in the RStudio IDE, JupyterLab, Jupyter Notebooks, or VS Code
MAKE PACKAGES Windows/Linux Mac Download a free 45 day evaluation at

Load All (devtools) Ctrl+Shift+L Cmd+Shift+L www.rstudio.com/products/workbench/evaluation/


Test Package (Desktop)
Document Package
Ctrl+Shift+T
Ctrl+Shift+D
Cmd+Shift+T
Cmd+Shift+D Share Projects
File > New Project
RStudio saves the call history,

Visual Editor
workspace, and working Start new R Session Close R Session
Choose Choose Insert Jump to Jump Run directory associated with a in current project in project
Check Render output output code previous to next selected Publish Show file project. It reloads each when
spelling output format location chunk chunk chunk lines to server outline you re-open a project.
T H J
Back to
Source Editor
Block (front page) Active shared
format collaborators
Name of
current
Lists and Links Citations Images File outline project
Insert blocks, Select

block
citations, Insert and Share Project R Version
quotes More
formatting equations, and edit tables with Collaborators
Clear special
formatting characters
Insert
verbatim
code
Run Remote Jobs
Run R on remote clusters
(Kubernetes/Slurm) via the
Job Launcher
Add/Edit
attributes Monitor Launch a job
launcher jobs

Run this and


Set knitr all previous
chunk code chunks
options
Run this
Jump to chunk code chunk
or header

Run launcher
jobs remotely

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07
6.3 SQL language
15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Study Guide: Data Retrieval with SQL Category Operator Command


Equality / non-equality = / !=, <>
Inequalities >=, >, <, <=
Afshine Amidi and Shervine Amidi Belonging IN (val_1, ..., val_n)
General
And / or AND / OR
August 21, 2020 Check for missing value IS NULL
Between bounds BETWEEN val_1 AND val_2
Strings Pattern matching LIKE ’%val%’
General concepts
r Structured Query Language – Structured Query Language, abbreviated as SQL, is a
language that is largely used in the industry to query data from databases. r Joins – Two tables table_1 and table_2 can be joined in the following way:

r Query structure – Queries are usually structured as follows: SQL


...
SQL FROM table_1 t1
-- Select fields.....................mandatory type_of_join table_2 t2
SELECT ..ON (t2.key = t1.key)
....col_1,
....col_2, ...
........ ,
....col_n where the different type_of_join commands are summarized in the table below:
-- Source of data....................mandatory
FROM table t Type of join Illustration

-- Gather info from other sources....optional


JOIN other_table ot INNER JOIN
..ON (t.key = ot.key)

-- Conditions........................optional
WHERE some_condition(s)
-- Aggregating.......................optional LEFT JOIN
GROUP BY column_group_list
-- Sorting values....................optional
ORDER BY column_order_list
RIGHT JOIN
-- Restricting aggregated values.....optional
HAVING some_condition(s)
-- Limiting number of rows...........optional
LIMIT some_value FULL JOIN

Remark: the SELECT DISTINCT command can be used to ensure not having duplicate rows.

r Condition – A condition is of the following format: Remark: joining every row of table 1 with every row of table 2 can be done with the CROSS JOIN
command, and is commonly known as the cartesian product.
SQL
some_col some_operator some_col_or_value
Aggregations
where some_operator can be among the following common operations: r Grouping data – Aggregate metrics are computed on grouped data in the following way:

Massachusetts Institute of Technology 1 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

WHERE HAVING
- Filter condition applies to individual rows - Filter condition applies to aggregates
- Statement placed right after FROM - Statement placed right after GROUP BY

Remark: if WHERE and HAVING are both in the same query, WHERE will be executed first.

The SQL command is as follows:


Window functions
SQL
r Definition – A window function computes a metric over groups and has the following struc-
SELECT ture:
....col_1,
....agg_function(col_2)
FROM table
GROUP BY col_1

r Grouping sets – The GROUPING SETS command is useful when there is a need to compute
aggregations across different dimensions at a time. Below is an example of how all aggregations
across two dimensions are computed:
The SQL command is as follows:
SQL
SQL
SELECT
....col_1, some_window_function() OVER(PARTITION BY some_col ORDER BY another_col)
....col_2,
....agg_function(col_3) Remark: window functions are only allowed in the SELECT clause.
FROM table
GROUP BY ( r Row numbering – The table below summarizes the main commands that rank each row
..GROUPING SETS across specified groups, ordered by a specific column:
....(col_1),
....(col_2),
....(col_1, col_2) Command Description Example
)
ROW_NUMBER() Ties are given different ranks 1, 2, 3, 4
RANK() Ties are given same rank and skip numbers 1, 2, 2, 4
r Aggregation functions – The table below summarizes the main aggregate functions that
can be used in an aggregation query: DENSE_RANK() Ties are given same rank and don’t skip numbers 1, 2, 2, 3

Category Operation Command


r Values – The following window functions allow to keep track of specific types of values with
Mean AVG(col) respect to the partition:
Percentile PERCENTILE_APPROX(col, p)
Command Description
Values Sum / # of instances SUM(col) / COUNT(col)
FIRST_VALUE(col) Takes the first value of the column
Max / min MAX(col) / MIN(col)
LAST_VALUE(col) Takes the last value of the column
Variance / standard deviation VAR(col) / STDEV(col)
LAG(col, n) Takes the nth previous value of the column
Arrays Concatenate into array collect_list(col)
LEAD(col, n) Takes the nth following value of the column
Remark: the median can be computed using the PERCENTILE_APPROX function with p equal to 0.5. NTH_VALUE(col, n) Takes the nth value of the column
r Filtering – The table below highlights the differences between the WHERE and HAVING com-
mands:

Massachusetts Institute of Technology 2 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Advanced functions Category Operation Command

r SQL tips – In order to keep the query in a clear and concise format, the following tricks are Take first non-NULL value COALESCE(col_1, col_2, ..., col_n)
often done: General Create a new column
CONCAT(col_1, ..., col_n)
Operation Command Description combining existing ones

Renaming New column names shown in Value Round value to n decimals ROUND(col, n)
SELECT operation_on_column AS col_name
columns query results Converts string column to
LOWER(col) / UPPER(col)
Abbreviation used within lower / upper case
Abbreviating
FROM table_1 t1 query for simplicity in Replace occurrences of
tables REPLACE(col, old, new)
notations old in col to new
Specify column position in String Take the substring of col,
Simplifying SUBSTR(col, start, length)
GROUP BY col_number_list SELECT clause instead of with a given start and length
group by
whole column names
Remove spaces from the
Limiting LTRIM(col) / RTRIM(col) / TRIM(col)
LIMIT n Display only n rows left / right / both sides
results
Length of the string LENGTH(col)
Truncate at a given granularity
r Sorting values – The query results can be sorted along a given set of columns using the DATE_TRUNC(time_dimension, col_date)
following command: Date (year, month, week)

SQL Transform date DATE_ADD(col_date, number_of_days)

... [query] ...


ORDER BY col_list r Conditional column – A column can take different values with respect to a particular set
of conditions with the CASE WHEN command as follows:
Remark: by default, the command sorts in ascending order. If we want to sort it in descending SQL
order, the DESC command needs to be used after the column.
CASE WHEN some_condition THEN some_value
r Column types – In order to ensure that a column or value is of one specific data type, the ..................
following command is used: .....WHEN some_other_condition THEN some_other_value
.....ELSE some_other_value_n END
SQL
CAST(some_col_or_value AS data_type)
r Combining results – The table below summarizes the main ways to combine results in
queries:
where data_type is one of the following:
Category Command Remarks
Data type Description Example
UNION Guarantees distinct rows
INT Integer 2 Union
UNION ALL Potential newly-formed duplicates are kept
DOUBLE Numerical value 2.0
Intersection INTERSECT Keeps observations that are in all selected queries
STRING
String ’teddy bear’
VARCHAR r Common table expression – A common way of handling complex queries is to have tem-
porary result sets coming from intermediary queries, which are called common table expressions
DATE Date ’2020-01-01’ (abbreviated CTE), that increase the readability of the overall query. It is done thanks to the
TIMESTAMP Timestamp ’2020-01-01 00:00:00.000’ WITH ... AS ... command as follows:
SQL
Remark: if the column contains data of different types, the TRY_CAST() command will convert
unknown types to NULL instead of throwing an error. WITH cte_1 AS (
SELECT ...
r Column manipulation – The main functions used to manipulate columns are described in ),
the table below:

Massachusetts Institute of Technology 3 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Command Description
...
OVERWRITE Overwrites existing data
cte_n AS (
SELECT ... INTO Appends to existing data
)
SELECT ... r Dropping table – Tables are dropped in the following way:
FROM ...
SQL
DROP TABLE table_name;

Table manipulation
r View – Instead of using a complicated query, the latter can be saved as a view which can
r Table creation – The creation of a table is done as follows: then be used to get the data. A view is created with the following command:

SQL SQL
CREATE [table_type] TABLE [creation_type] table_name( CREATE VIEW view_name AS complicated_query;
..col_1 data_type_1,
...................,
..col_n data_type_n Remark: a view does not create any physical table and is instead seen as a shortcut.
)
[options];

where [table_type], [creation_type] and [options] are one of the following:

Category Command Description


Blank Default table
Table type
EXTERNAL TABLE External table
Creates table and overwrites current
Blank
Creation type one if it exists
IF NOT EXISTS Only creates table if it does not exist
Populate table with data
location ’path_to_hdfs_folder’
Options from hdfs folder
Stores the table in a specific data
stored as data_format
format, e.g. parquet, orc or avro

r Data insertion – New data can either append or overwrite already existing data in a given
table as follows:

SQL
WITH ..............................-- optional
INSERT [insert_type] table_name....-- mandatory
SELECT ...;........................-- mandatory

where [insert_type] is among the following:

Massachusetts Institute of Technology 4 https://www.mit.edu/~amidi


Sources
W3Schools.com
DataQuest.io
SQL CONSIDER
SUPPORTING ME

@AbzAaron

CHEATSHEET

Commands / Clauses Joins Examples

Select all columns with filter applied

a b

Select first 10 rows for two columns


a INNER JOIN b

Select all columns with multiple filters


a b

a LEFT JOIN b
Select all rows from col1 & col2 ordering by col1

a b
Return count of rows in table

a RIGHT JOIN b

Return sum of col1

a b

Return max value for col1


a FULL OUTER JOIN b

Compute summary stats by grouping col2

Data Definition Language Order Of


Execution
CREATE ALTER
Combine data from 2 tables using left join

FROM

DROP WHERE Aggregate and filter result

GROUP BY

HAVING
Data Manipulation Language
UPDATE INSERT
SELECT Implementation of CASE statement

ORDER BY

DELETE SELECT LIMIT


SQL cheat sheet

Basic Queries The Joy of JOINs


-- filter your columns
SELECT col1, col2, col3, ... FROM table1
-- filter the rows

A B A B
WHERE col4 = 1 AND col5 = 2
-- aggregate the data
GROUP by …
-- limit aggregated data
HAVING count(*) > 1
-- order of the results
ORDER BY col2 LEFT OUTER JOIN - all rows from table A, INNER JOIN - fetch the results that RIGHT OUTER JOIN - all rows from table B,
even if they do not exist in table B exist in both tables even if they do not exist in table A
Useful keywords for SELECTS:
DISTINCT - return unique results
BETWEEN a AND b - limit the range, the values can be Updates on JOINed Queries Useful Utility Functions
numbers, text, or dates You can use JOINs in your UPDATEs -- convert strings to dates:
LIKE - pattern search within the column text UPDATE t1 SET a = 1 TO_DATE (Oracle, PostgreSQL), STR_TO_DATE (MySQL)
IN (a, b, c) - check if the value is contained among given. FROM table1 t1 JOIN table2 t2 ON t1.id = t2.t1_id -- return the first non-NULL argument:
WHERE t1.col1 = 0 AND t2.col2 IS NULL; COALESCE (col1, col2, “default value”)
-- return current time:
Data Modification NB! Use database specific syntax, it might be faster! CURRENT_TIMESTAMP
-- update specific data with the WHERE clause -- compute set operations on two result sets
UPDATE table1 SET col1 = 1 WHERE col2 = 2 Semi JOINs SELECT col1, col2 FROM table1
-- insert values manually UNION / EXCEPT / INTERSECT
You can use subqueries instead of JOINs: SELECT col3, col4 FROM table2;
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME)
VALUES (1, ‘Rebel’, ‘Labs’); SELECT col1, col2 FROM table1 WHERE id IN
-- or by using the results of a query (SELECT t1_id FROM table2 WHERE date > Union - returns data from both queries
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME) CURRENT_TIMESTAMP) Except - rows from the first query that are not present
SELECT id, last_name, first_name FROM table2 in the second query
Intersect - rows that are returned from both queries
Indexes
Views If you query by a column, index it!
A VIEW is a virtual table, which is a result of a query. CREATE INDEX index1 ON table1 (col1) Reporting
They can be used to create virtual tables of complex queries. Use aggregation functions
Don’t forget:
CREATE VIEW view1 AS COUNT - return the number of rows
SELECT col1, col2 Avoid overlapping indexes SUM - cumulate the values
FROM table1 Avoid indexing on too many columns AVG - return the average for the group
WHERE … Indexes can speed up DELETE and UPDATE operations MIN / MAX - smallest / largest value
7. Business Intelligence
7.1 Tableau
3. Measures: A measure is a type of field that contains quantitative values (e.g. revenue, costs, and Aggregating data
market sizes). When dragged into a view, this data is aggregated, which is determined by the When data is dragged into the Rows and Columns on a sheet, it is aggregated based on the dimensions in the sheet.
dimensions in the view
This is typically a summed value. The default aggregation can be changed using the steps below:
4. Data types: Every field has a data type which is determined by the type of information it contains.
The available data types in Tableau include text, date values, date & time values, numerical values, Right-click on a measure field in the Data pan
Go down to Default properties, Aggregation, and select the aggregation you would like to use
Tableau for Business Intelligence
boolean values, geographical values, and cluster groups

Changing colors
Color is a critical component of visualizations. It draws attention to details. Attention is the most important
The Canvas
Tableau Basics Cheat Sheet The canvas is where you’ll create data visualizations
component of strong storytelling. Colors in a graph can be set using the marks card.
Create a visualization by dragging fields into the Rows and Columns section at the top of the scree
Drag dimensions into the Marks field, specifically into the Color squar
1. Tableau Canvas: The canvas takes up most of the screen on Tableau and is where you can add visualizations
To change from the default colors, go to the upper-right corner of the color legend and select Edit Colors. This
earn Tableau online at www.DataCamp.com
L 2. Rows and columns: Rows and columns dictate how the data is displayed in the canvas. When dimensions will bring up a dialog that allows you to select a different palette
are placed, they create headers for the rows or columns while measures add quantitative values

3. Marks card: The marks card allows users to add visual details such as color, size, labels, etc. to rows and columns. Changing fonts
This is done by dragging fields from the data pane into the marks card

Fonts can help with the aesthetic of the visualization or help with consistent branding. To change the workbook s font,

use the following steps

What is Tableau? > Visualizing Your First Dataset


In the Format menu on the top ribbon, press on Select Workbook. This will replace the Data pane and
allow you to make formatting decisions for the Workboo
From here, select the font, font size, and color

Tableau is a business intelligence tool that allows you to Upload a dataset to Tableau
effectively report insights through easy-to-use
Launch Tablea
customizable visualizations and dashboards
In the Connect section, under To a File, press on the file format of your choice
For selecting an Excel file, select .xlsx or .xlsx
> Creating dashboards with Tableau
Creating your first visualization Dashboards are an excellent way to consolidate visualizations and present data to a variety of stakeholders. Here is a
Once your file is uploaded, open a Worksheet and click on the Data pane on the left-hand sid step by step process you can follow to create a dashboard.

> Why use Tableau? Drag and drop at least one field into the Columns section, and one field into the Rows section at the top
of the canva
To add more detail, drag and drop a dimension into the Marks card (e.g. drag a dimension over the color square
Launch Tablea
In the Connect section under To A File, press on your desired file typ
Select your fil
in the marks card to color visualization components by that dimension Click the New Sheet at the bottom to create a new shee
Easy to use—no coding Integrates seamlessly with Fast and can handle large To a summary insight like a trendline, click on the Analytics pane and drag the trend line into your visualization
involved any data source datasets Create a visualization in the sheet by following the steps in the previous sections of this cheat shee
You can change the type of visualization for your data by clicking on the Show Me button on the top right Repeat steps 4 and 5 untill you have created all the visualizations you want to include in your dashboar
Click the New Dashboard at the bottom of the scree
On the left-hand side, you will see all your created sheets. Drag sheets into the dashboar
> Tableau Versions > Data Visualizations in Tableau
Adjust the layout of your sheets by dragging and dropping your visualizations

There are two main versions of Tableau


Tableau provides a wide range of data visualizations to use. Here is a list of the most useful visualizations you
T ableau Public T ableau Deskto p have in Tableau
A free version of Tableau that lets you connect to limited A paid version of tableau which lets you connect to
data sources, create visualizations and dashboards, and all types of data sources, allows you to save work Bar Charts: Horizontal bars used for comparing specific values across categories (e.g. sales by region)
publish dashboards online
locally, and unlimited data sizes

Stacked Bar Chart: Used to show categorical data within a bar chart (e.g., sales by region and department)

Side-by-Side Bar Chart: Used to compare values across categories in a bar chart format (e.g., sales by
> Getting started with Tableau region comparing product types)

L ine Charts: Used for looking at a numeric value over time (e.g., revenue over time)

When working with Tableau, you will work with Workbooks. Workbooks contain sheets, dashboards, and stories.
Similar to Microsoft Excel, a Workbook can contain multiple sheets. A sheet can be any of the following and can be Scatter Plot: Used to identify patterns between two continuous variables (e.g., profit vs. sales volume)

Dashboard examples in Tableau


accessed on the bottom left of a workbook

H istogram: Used to show a distribution of data (e.g., Distribution of monthly revenue)

Worksheet Dashboard story Box-and-Whisker Plot: Used to compare distributions between categorical variables (e.g., distribution of
A worksheet is a single
view in a workbook. You
A collection of multiple
worksheets used to
A story is a collection of
multiple dashboards and/
revenue by region)

eat Map: Used to visualize data in rows and columns as colors (e.g., revenue by marketing channel)

> Creating stories with Tableau


display multiple views
H
can add shelves, cards, or sheets that describe a
legends, visualizations, simultaneously
data story
A story is a collection of multiple dashboards and/or sheets that describe a data story
and more in a worksheet
Highlight Table: Used to show data values with conditional color formatting (e.g., site-traffic by marketing
channel and year)
Click the New Story at the bottom of the scree
Change the size of the story to the desired size in the bottom left-hand corner of the screen under Siz
Symbol Map: Used to show geographical data (e.g., Market size opportunity by state)
Edit the title of the story by renaming the story. To do this, right-click on the story sheet at the bottom
and press Renam

> The Anatomy of a Worksheet M ap: Used to show geographical data with color formatting (e.g., Covid cases by state)

Treemap: Used to show hierarchical data (e.g., Show how much revenue subdivisions generate relative to
A story is made of story points, which lets you cycle through different visualizations and dashboard
To begin adding to the story, add a story point from the left-hand side. You can add a blank story poin
To add a summary text to the story, click Add a caption and summarize the story poin
the whole department within an organization)

Add as many story points as you would like to finalize your data story

When opening a worksheet, you will work with a variety of tools and interfaces
Dual Co bination: Used to show two visualizations within the same visualization (e.g., pro it or a store each
m f f

month as a bar chart with in entory o er time as a line chart)

v v

The Sidebar
In the sidebar, you’ll find useful panes for working with dat
Data: The data pane on the left-hand side contains all of the fields in the currently selected data sourc
> Customizing Visualizations with Tableau
Analytics: The analytics pane on the left-hand side lets you add useful insights like trend lines, error bars,
Tableau provides a deep ability to filter, format, aggregate, customize, and highlight specific parts of your data
and other useful summaries to visualizations

visualizations

Filtering data with highlights


Tableau Data Definitions Once you’ve created a visual, click and drag your mouse over the specific portion you want to highlight
2. Once you let go, you will have the option to Keep Only or Exclude the data
When working with data in Tableau, there are multiple definitions to be mindful o 3. Open the Data pane on the side bar. Then, you can drag-and-drop a field into the fitlers card just to the
left of the pane. Stories examples in Tableau
Fields: Fields are all of the different columns or values in a data source or that are calculated in the
workbook. They show up in the data pane and can either be dimension or measure field Filtering data with filters
Dimensions: A dimension is a type of field that contains qualitative values (e.g. locations, names, and
Open the Data pane on the left-hand-sid
departments). Dimensions dictate the amount of granularity in visualizations and help reveal nuanced details
in the data

Drag-and-drop a field you want to filter on and add it to the Filters car
Fill out in the modal how you would like your visuals to be filtered on the data
Learn Data Skills Online at www.DataCamp.com
Ta bl ea u -Des k top Data Sources
Types Work Types Work

C H E AT S H E E T
Filter Dimensions Applied on the dimension fields. Multiple Values (List) Select one or more values in a list.

Filter Measures Applied on the measure fields. Multiple Values (Dropdown) Select one or more values in a drop-down list.
Data Sources Filter Dates Applied on the date fields. Multiple Values (Custom List) Search and select one or more values.
File Systems CSV, Excel, etc. Single Value (List) Select one value at a time in a list. Single Value (Slider) Drag a horizontal slider to select a single value.
Relational Systems Oracle, Sql Server, DB2, etc.
Single Value (Dropdown) Select a single value in a drop-down list. Wildcard Match Select values containing the specified characters.
Cloud Systems Windows Azure, Google BigQuery, etc.
Other Sources ODBC Tableau Charts
Data Extract Type Description
• Extraction of data is done by following Text Table (Crosstab) To see your data in rows and columns.
Me u → Data → E tract Data.

Heat Map Just like Crosstab, but it uses size and color as visual cues to describe the data.
Applying Extract Filters to create subset of data

Highlight Table Just like Excel table, but the cells here are colored.
To add more data for an already created extract
Data → E tract → Appe d Data fro File Symbol Map Visualize and highlight geographical data.

• •Relational
Extract History Operators Filled Map Color filled geographical data visualization.
Menu - Data → E tract Histor Pie Chart Represents data as slices of a circle with different sizes and colors.

•Data
Logical
Joining Operators
Data Blending
Horizontal Bar Chart
Stacked Bar Chart
Represents data in horizontal bars, visually digestible.
Visualize data of a category having sub-categories.
• Creating a Join • Preparing Data for Blending
Side-by-Side Bar Chart Side by side comparison of data, vertical representation.
• Editing a Join Type • Adding Secondary Data Source
Treemap Similar to a heat map, but the boxes are grouped by items that are close in hierarchy.
• Editing Join Fields • Blending the Data
Circle View Shows the different values that are within the categories.
Operators Side-by-Side Circle View fields) Combination of Circle view and Side-by-Side Bar Chart

• General Operators • Relational Operators Line Chart (Continuous) Several number of lines in the view to show continuous flow of data, must have a date.
• Arithmetic Operators • Logical Operators Line Chart (Discrete) This allows slicing and dicing of the graph, graph not continuous.
Dual Line Chart Comparing two measures over a period.
LOD Expressions
Scatter Plot Scatter plot shows many points scattered in the Cartesian plane
• Fixed LOD , Include LOD and Exclude LOD Histogram A histogram represents the frequencies of values of a variable bucketed into ranges
Gantt Chart It illustrates a project schedule.
Sorting
Bullet Graph Two bars drawn upon one another to indicate their individual values at the same position in the graph
• Computed Sorting: Directly applied on an axis using the sort
Waterfall Chart It shows where a value starts, ends and how it gets there incrementally
dialog button.
• Manual Sorting: Rearrange the order of dimension fields by FURTHERMORE:
dragging them next to each other. Tableau Training and Certification - Tableau 10 Desktop Course
File Short Cuts: Tableau Terminologies

Ta bl ea u -Des k top File


ALT+F4
Operation
Closes the current workbook
Alias: Refers to a field or to a dimension member

Bin: User-defined grouping of measures in the data source

Shortcuts &
ALT+F+E+I Export to image
Bookmark: .tbm file in the Bookmarks folder contains a single worksheet of the Tableau repository
ALT+F+E+P Export to packaged workbook
Calculated Field: New field created by using a formula to modify the existing fields in data source
CTRL+N New workbook
Crosstab: Text table view to display the numbers associated with dimension members

Te r m i n o l o g i e s
CTRL+O Open file
Dashboard: Use dashboards to compare and monitor a variety of data simultaneously
CTRL+P Print
Data Pane: Displays the fields (divided into dimensions and measures) of the data sources to which Tableau is
CTRL+S Save file

C H E AT S H E E T
connected

File Short Cuts: Data Source Page: A page to set up your data source consists of − left pane, join area, preview area, and metadata

Analysis Operation area


What is Tableau?
ALT+A+A Aggregate measures Dimension: Categorical data field holds discrete data such as hierarchies and members that cannot be aggregated

A powerful Data visualization and Business intelligence tool with a ALT+A+C Create calculated field Extract: A saved subset of a data source that can be used to improve performance and analyze offline.
strong and intuitive interface. No coding knowledge or experience ALT+A+B Describe trend model Filters Shelf: Used to exclude data from a view by filtering it using measures and dimensions
Connect to data sourceBuild data viewsEnhance
needed to work with Tableau. ALT+A+U data viewsWorksheetsCreate
Edit calculated field Format Pane: Contains formatting and organize
settings that control the entiredashboardsStory
worksheet, as well as individual fields in the view
Data Short
Telling Cuts: ALT+A+L Edit trend lines
Level of Detail (LOD) Expression: A syntax that supports aggregation at dimensionalities other than the view level.
Data Operation ALT+A+F Filter
Marks: A part of the view that visually represents one or more rows in a data source. A mark can be, for example, a
ALT+D+A Automatic updates CTRL+1 Show me!
bar, line, or square. You can control the type, color, and size of marks
ALT+D+D Connect to data ALT+A+S Sort
Marks Card: A card to the left of the view, where you can drag fields to control mark properties such as type, color,
CTRL+D Connect to data source ALT+A+M+F Stack marks off
size, shape, label, tooltip, and detail
ALT+D+C+D Duplicate data connection ALT+A+M+O Stack marks on
Pages Shelf: Used to split a view into a sequence of pages based on the members and values in a discrete or
ALT+D+A Extract Get started with Tableau continuous field
ALT+D+C+P Properties of data connection
• Tableau is BI software Rows Shelf: Used to create the rows of a data table, also accepts any number of dimensions and measures
ALT+D+R Refresh data
• Allows users to connect to data, visualize data and create Worksheet: A sheet to build views of the data
F5 Refreshes the data source
interactive and sharable dashboards. Workbook: Contains one or more worksheets
F9 Run query
ALT+D+U Toggle use extract Design Flow
F10 Toggles automatic updates on and off
Connect to Data Source  Build Data Views  Enhance Data Views  Worksheets  Create and Organize Dashboards  Story Telling
ALT+D+P+D Update dashboard
ALT+D+P+Q Update quick filters FURTHERMORE:
ALT+D+P+W Update worksheet Tableau Training and Certification - Tableau 10 Desktop Course
7.2 Power BI
Appending datasets

Create your first visualization You can append one dataset to anothe

Click on the Report View and go to the Visualizations pane on the right-hand sid Click on Append Queries under the Home tab under the Combine grou
Select the type of visualization you would like to plot your data on. Keep reading this cheat to learn different Select to append either Two tables or Three or more table
Add tables to append under the provided section in the same window

Power BI for Business Intelligence


visualizations available in Power BI
Under the Field pane on the right-hand side, drag the variables of your choice into Values or Axis.

Merge Queries

Values let you visualize aggregate measures (e.g. Total Revenue)


You can use merge tables based on a related column
Axis let you visualize categories (e.g. Sales Person)

Power BI Cheat Sheet Click on Merge Queries under the Home tab under the Combine grou
Select the first table and the second table you would like to merge
Aggregating data Select the columns you would like to join the tables on by clicking on the column from the first dataset, and from
the second datase

earn Power BI online at www.DataCamp.com


L Power BI sums numerical fields when visualizing them under Values. However, you can choose different aggregation
Select the Join Kind that suits your operation:

Select the visualization you just create


Go to the Visualizations section on the right-hand sid
Go to Values—the visualized column should be there

What is Power BI?


On the selected column—click on the dropdown arrow and change the aggregation (i.e., AVERAGE, MAX, e outer
L ft Right outer Full outer Inner e anti
L ft Right anti
COUNT, etc..)
Click on Ok—new columns will be added to your current table

Data profiling

Power BI is a business intelligence tool that allows you


to effectively report insights through easy-to-use
customizable visualizations and dashboards.

> Data Visualizations in Power BI Data Profiling is a feature in Power Query that provides intuitive information about your dat

Click on the View tab in the Query ribbo


Power BI provides a wide range of data visualizations. Here is a list of the most useful visualizations you have in Power BI In the Data Preview tab—tick the options you want to visualiz
Bar Charts: Horizontal bars used for comparing specific values across categories (e.g. sales by region) Tick Column Quality to see the amount of missing dat
Tick Column Distribution to see the statistical distribution under every colum
Column Charts: Vertical columns for comparing specific values across categories
Tick Column Profile to see summary statistics and more detailed frequency information of columns

> Why use Power BI? L ine Charts: Used for looking at a numeric value over time (e.g. revenue over time)

> DAX Expressions


Area Chart: Based on the line chart with the difference that the area between the axis and line is filled in (e.g.
sales by month)

Easy to use—no coding Integrates seamlessly with Fast and can handle large Scatter: Displays one set of numerical data along the horizontal axis and another set along the vertical axis (e.g.
involved any data source datasets relation age and loan)
Combo Chart: Combines a column chart and a line chart (e.g. actual sales performance vs target) Data Analysis Expressions (DAX) is a calculation language used in Power BI that lets you create calculations and
Treemaps: Used to visualize categories with colored rectangles, sized with respect to their value (e.g. product perform data analysis. It is used to create calculated columns, measures, and custom tables. DAX functions are

> Power BI Components


category based on sales) predefined formulas that perform calculations on specific values called arguments.
Pie Chart: Circle divided into slices representing a category's proportion of the whole (e.g. market share)
Donut Chart: Similar to pie charts; used to show the proportion of sectors to a whole (e.g. market share)
Sample data

Throughout this section, we’ll use the columns listed in this sample table of `sales_data`
There are three components to Power BI—each of them serving different purposes Maps: Used to map categorical and quantitative information to spatial locations (e.g. sales per state)
Cards: Used for displaying a single fact or single data point (e.g. total sales) deal_size sales_person date customer _name
P ow e r B I D e s k to p P ow e r B I s e r v i c e P ow e r B I m o b i l e
Free desktop application that Cloud-based version of Power BI A mobile app of Power BI, which Table: Grid used to display data in a logical series of rows and columns (e.g. all products with sold items) 1,000 Maria Shuttleworth 30-03-2022 Acme Inc.
provides data analysis and with report editing and publishing allows you to author, view, and 3,000 Nuno Rocha 29-03-2022 Spotflix
creation tools. features. share reports on the go.
2,300 Terence Mickey 13-04-2022 DataChamp

> Power Query Editor in Power BI Simple aggregation


> Getting started with Power BI Power Query is Microsoft’s data transformation and data preparation engine. It is part of Power BI Desktop, and lets
UM
S

AVERAGE
ME AN DI
l (<co

l
l
adds all the numbers in a colum

(<co
umn>)

returns the average (arithmetic mean) of all numbers in a colum


(<co umn>)

returns the median of numbers in a colum


umn>)

There are three main views in Power BI you connect to one or many data sources, shape and transform data to meet your needs, and load it into Power BI. M N MAX
I / l returns the smallest biggest value in a colum
(<co umn>) /

COUNT l (<cocounts the number of cells in a column that contain non blank value
umn>) -
report view da t a v i e w model view T NCTCOUNT
DIS I l counts the number of distinct values in a column.
(<co umn>)

This view is the default This view lets you examine This view helps you Open the Power Query Editor EX AM P LE
view, where you can datasets associated with establish different
Sum of all deals — SUM(‘sales_data’[deal_size]
visualize data and create your reports relationships between
While loading dat Average deal size — AVERAGE(‘sales_data’[deal_size]
reports datasets
Underneath the Home tab, click on Get Dat Distinct number of customers — DISTINCTCOUNT(‘sales_data’[customer_name])
Choose any of your datasets and double clic
Click on Transform Data

Logical unction
f

IF(< l ogic al_test > , <v al e_ _t e


u if ru >[ , <v al e_ _ alse
u if f >]) check the result of an expression and

> Visualizing your first dataset When data is already loade


Go to the Data Vie EX
create conditional results
AM P LE

Under Queries in the Home tab of the ribbon, click on Transform Data drop-down, then on the Transform Data Create a column called large_deal that returns “Yes” if deal_size is bigger than 2,000 and “No” otherwise
button la e_deal sales_data deal_s e , es , N
Upload datasets into Power BI rg = IF( ‘ ’[ iz ] > 2000 “Y ” “ o”)

e t unction
Using the Power Query Editor
T x F

Underneath the Home tab, click on Get Dat LE F T te t ,


(< x _ a s returns the specified number of characters from the start of a tex
> <num ch r >)

Choose any of your datasets and double clic LO W ER te t converts a text string to all lowercase letter
(< x >)

Click on Load if not prior data needs processin Removing rows


UPP ER te t converts a text string to all uppercase letter
(< x >)

If you need to transform the data, click Transform which will launch Power Query. Keep reading this cheat sheet for You can remove rows dependent on their location, and propertie RE PL ACE ld_te t , sta t_
(<o x > , < _ a s , e _te t replaces part of a text string with a
r num> <num ch r > <n w x >)

how to apply transformations in Power Query Click on the Home tab in the Query ribbo different text string.
Inspect your data by clicking on the Data View Click on Remove Rows in the Reduce Rows group EX AM P LE

Choose which option to remove, whether Remove Top Rows, Remove Bottom Rows, etc. Change column customer_name be only lower case 

Choose the number of rows to remov st e _ a e O ER sales_data st e _ a e
Create relationships in Power BI
cu om r n m = L W (‘ ’[cu om r n m ])

You can undo your action by removing it from the Applied Steps list on the right-hand side

Date and time function


If you have different datasets you want to connect. First, upload them into Adding a new column
CA EN AR sta t
L D (< date , e d date generates a column of continuous sets of date
r > < n >)
Sales Performance
Power B You can create new columns based on existing or new dat DATE ea ,(<y t , da returns the specified date in the datetime forma
r> <mon h> < y>)

SalesPersonID
Click on the Model View from the left-hand pan Click on the Add Column tab in the Query ribbo WEE A date ,
KD Y(< et _t e returns 1 corresponding to the day of the week of a date ( et
> <r urn yp >) -7 r urn _t e
yp

Connect key columns from different datasets by dragging one to another Click on Custom Column in the General grou indicates week start and end (1: Sunday Saturday, 2: Monday Sunday)
- -

(e.g., EmployeeID to e.g., SalespersonID) Name your new column by using the New Column Name optio
Employee Database EX AM P LE
Define the new column formula under the custom column formula using the available data

EmployeeID
Return the day of week of each deal

Replace values

week_da EE A sales_data
y = W KD Y(‘ ’[ date ] , 2)

You can replace one value with another value wherever that value is found in a colum
In the Power Query Editor, select the cell or column you want to replac
Click on the column or value, and click on Replace Values under the Home tab under the Transform grou
Fill the Value to Find and Replace With fields to complete your operation Learn Data Skills Online at www.DataCamp.com
OVERVIEW
What is Power BI? Power Query DAX Drill Down License
Works with data fetched from data sources using Language developed for data analysis. It enables the The Visual that supports the embedding of hierarchies Per-user License
"It is Microsoft’s Self-Service Business connectors. This data is then processed at the Power enables drilling down to the embedded hierarchy’s › Free—Can be obtained for any Microsoft work or school
creation of the following objects using expressions:
Intelligence tool for processing and BI app level and stored to an in-memory database in › Measures individual levels using the following symbols: email account. Intended
for personal use. Users with this license can only
analyzing data." the program background. This means that data is not › Calculated Columns
use the personal workspace. They cannot share
processed at the source level. The basic unit in Power › Calculated Tables Drill up to a higher-level hierarchy or consume shared content.
Query is query, which means one sequence Each expression starts with the = sign, followed
Components consisting of steps. A step is a data command that by links to tables/columns/functions/measures and Drill down to a specific field
"If it is not available in Premium workspace"
› Pro—It is associated with a work/school account priced at
dictates what should happen to the data when it is operators. The following operators are supported: €8.40 per month or it is included in the E5 license. Intended for
› Power BI Desktop—Desktop application › Arithmetic { + , - , / , * , ^ } Drill down to the next level in the hierarchy team collaboration. Let's users access team workspaces,
loaded into Power BI. The basic definition of each › Comparison { = , == , > , < , >= , <= , <> }
› Report—Multi-page canvas visible to end users. It serves consume shared content, and use apps.
for the placement of visuals, buttons, images, slicers, etc.
step is based on its use: › Text concatenation { & , && , II , IN } Expand next-level hierarchy › Premium per User – Includes all Power BI Pro license
› Data—Preview pane for data loaded into a model. › Connecting data—Each query begins with a function that › Precedence { ( , ) } capabilities, and adds features such as paginated reports, AI,
› Model—Editable scheme of relationships between tables in provides data for the subsequent steps. E.g., data can be Operators and functions require that all greater frequency for refresh rate, XMLA endpoint and other
loaded from Excel, SQL database, SharePoint etc. Connection
a model. Pages can be used in a model for easier navigation.
› Power Query—A tool for connecting, transforming,
steps can also be used later.
values/columns used are of the same data type
or of a type that can be freely converted; such
Tooltip/Custom Tooltip capabilities that are only available to Premium subscribers.
Per-tenant License
› Transforming data—Steps that modify the structure of the › Premium—Premium is set
and combining data. data. These steps include features such as Pivot Column, as a date or a number.
"Apart from the standard version, there is
› Tooltip —A default detail preview pane which up for individual workspaces. 0 to N workspaces
converting columns to rows, grouping data, splitting columns, appears above a visual when you hover over its values. can be used with a single version of this license. It provides
also a version for Report Server." removing columns, etc. Transformation steps are necessary in dedicated server computing power based
› Power BI Service—A cloud service enabling access
to, and sharing and administration of, output data.
order to clean data from not entirely clean data sources.
› Combining data—Data split into multiple source files needs Visualization on license type: P1, P2, P3, P4*, P5*. It offers more space for
datasets, extended metrics for individual workspaces,
› Workspace—There are three types of workspaces: to be combined so that it can be analyzed in bulk. Functions managed consumption of dedicated capacity, linking of Azure
include merging queries and appending queries. Visualizations or visuals let you present data in › Custom Tooltip —A custom tooltip is a custom- AI features with datasets, and access for users with Free
Personal, Team, and Develop a template app. They serve
as storage and enable controlled access to output data.
› Merge queries—This function merges queries based on the selected various graphical forms, from graphs to tables, designed report page identified as descriptive. licenses to shared
key. The primary query then contains a column which can be used to
› Dashboard—A space consisting of tiles in which visuals and extract data from a secondary query. Supports typical join types: maps, and values. Some visuals are linked to other When you hover over visual, a page appears content. Prices start at €4,212.30.
report pages are stored.* services outside Power BI, such as Power Apps. with content filtered based on criteria specified
*Only available upon special request. Intended for models larger than
100GB.
› Report—A report of pages containing visuals.*
› Worksheet—A published Excel worksheet. Can be used
by the value in the visual. › Embedded—Supports embedding dashboards and reports
in custom apps.
as a tile on a dashboard. › Report Server—Included in Premium or SQL Server Enterprise licenses.
› Dataset—A published sequence for fetching and
transforming data from Power BI Desktop. › Append query—Places the resulting data from one or more selected

Administration
queries under the primary query. In this case, data is placed in columns
› Dataflow—Online Power Query representing
with names that are an exact match. Non-matching columns form new
a special dataset outside of Power BI Desktop.* columns with a unique name in the primary query.
› Application—A single location combining one
In addition to basic visuals, Power BI supports › Use metrics—Usage metrics let you monitor Power BI usage
or more reports or dashboards.* › Custom function—A query intended to apply a pre-defined sequence of creating custom visuals. Custom visuals can be for your organization.
› Admin portal—Administration portal that lets you configure
Drill-through
steps so that the author does not need to create them repeatedly. The › Users—The Users tab provides a link to the Microsoft 365 admin center.
capacities, permissions, and capabilities for individual users custom function can also accept input data (values, sheets, etc.) to be used added using a file import or from a free Marketplace › Audit logs—The Audit logs tab provides a link to the Security &
and workspaces. in the sequence. offering certified and non-certified visuals. Compliance center.
› Parameter—Values independent of datasets. These values can then be › Tenant settings—Tenant settings enable fine-grained control over
*Can be created and edited in the Power BI Service Certification is optional, but it verifies whether, Drill-through lets you pass from a data overview
used in queries. Values enable the quick editing of a model because they features made available to your organization. It controls which features
environment. can be changed in the Power BI Service environment. among other things, a visual accesses external visual to a page with specific details. The target will be enabled or disabled and for which users and groups.
› Data Gateway—On-premises data gateway that lets you services and resources. page is displayed with all the applied filters affecting › Capacity settings—The Power BI Premium tab enables you to
transport data from an internal network or a custom device
to the Power BI Service.
Dataflow the value from which the drill-through originated.
manage any Power BI Premium and Embedded capacities.
› Embed codes—You can view the embed codes that are generated for
your tenant to share reports publicly. You can also revoke or delete codes.
› Power BI Mobile—Mobile app for viewing reports. Mobile
view is applied, if it exists, otherwise the desktop view is used.
The basic unit is a table or Entity consisting of Themes › Organization visuals—You can control which type of Power BI visuals
users can access across the organization.
› Report Server—On-premises version of Power BI Service.
columns or Fields. Just like Queries in Power › Azure connections—You can control workspace-level storage
› Report Builder—A tool for creating page reports. Query, Entities in Dataflows consist of sequences Serves as a single location for configuring all native permissions for Azure Data Lake Gen 2.
of steps. The result of such steps is stored in native › Workspaces—You can view the workspaces that exist in your tenant
graphical settings for visuals and pages. on the Workspaces tab.
Built-in and additional Azure Data Lake Gen 2.
"You can connect a custom Data Lake
› Custom branding—You can customize the look of Power BI for your
whole organization.
languages where the data will be stored." › Protection metrics—The report shows how sensitivity labels help
protect your content.
There are three types of entities: › Featured content—You can manage all the content promoted in the
By default, you can choose from 19 predefined
Built-in languages › Standard entity—It only works with data fetched directly Featured section.
› M/Query Language—Lets you transform data from a data source or with data from non-stored entities themes. Custom themes can be added.
Bookmarks
in Power Query.
› DAX (Data Analysis Expressions)—Lets you define custom
within the same dataflow.
Computed entity*—It uses data from another stored entity
A custom theme can be applied in two different ways:
› Modification of an existing theme—A native window that Bookmarks capture the currently configured view or a
External Tools
calculated tables, columns, and measures in Power BI Desktop. within the same dataflow. lets you modify a theme directly in the Power BI environment. report page visual. Later, you can go back to that state
"Both languages are natively available in Power BI, › Importing a JSON file—Any file you create only defines the They simplify the use of Power BI and extend the
› Linked entity*—Uses data from an entity located in another by selecting the saved bookmark. Setting options:
which eliminates the need to install anything." formatting that should change. Everything else remains the
› Data—Stores filters, applied sort order in visuals and slicers. capabilities offered in Power BI. These tools are
dataflow. If data in the original entity is updated, same. The advantage of this approach is that you can
the new data is directly passed to all By selecting the bookmark, you can re-apply the corresponding mostly developed by the community. Recommended
Additional languages customize any single visual.
settings. external tools:
› Python—Lets you fetch data and create visuals. linked entities.
*Can only be used in a dedicated Power BI Premium workspace. › Display—Stores the state of the display for visuals and › Tabular Editor
Requires installation of the Python language on your "The resulting theme can be exported in the JSON format and report elements (buttons, images, etc.). By selecting the › DAX studio
computer and enabling Python scripting. used in any report without the need to create a theme from
"It supports custom functions as well as parameters." bookmark, you can go back to the previously stored state
› R—Lets you fetch and transform data and create visuals. scratch." of the display. › ALM Toolkit
Requires installation of the R language on your computer
› Current page—Stores the currently displayed page. By › VertiPaq Analyzer
and enabling R scripting.
selecting the bookmark, you can go back the to stored page.

JAK NA POWER POWER BI CHEATSHEET


DAX

What is DAX? Calculated Columns Calculation contexts Calcuation Groups Hierarchy


› They are very similar to Calculated members from MDX. In › DAX itself has no capability within the hierarchy to
“ Data Analysis Expressions (DAX) is a › They behave like any other column in the table. › All calculations are evaluated on a base basis some Power BI, it is not possible to create them directly in the automatically convert your calculations to parent or child
Instead of coming from a data source, they are context that the environment brings to the Desktop application environment, but an External Tool
library of functions and operators created through a DAX expression evaluated based on calculation. (Evaluation context)
levels. Therefore, each level must Prepare Your Measures,
Tabular Editor is required. which are then displayed based on the ISINSCOPE function.
combined to create formulas and the current context line, and we cannot get values ​of › Context Filter - › This is a set of Calculation Items grouped according to their She tests which level to go just evaluating. Evaluation takes
The following calculation calculates purpose and whose purpose is to prepare an expression,
expressions “ another row directly.
the profit forindividual sales.
place from the bottom to the top level.
› Import mode. Their evaluation and storage is in progress which can be used for different input measures, so it doesn‘t › The native data model used by DAX does not directly support
when processing the model. have to write the same expression multiple times. To where its parent/child hierarchy. On the other hand, DAX contains
Introduction to DAX
Revenue =
› DirectQuery mode. They are evaluated at runtime, which may SUMX( Trades, she would be, but the input measure is placed functions that can convert this hierarchy to separate columns.
slow down the model. Trades[Quantity]* SELECTEDMEASURE(). › PATH - It accepts two parameters, where the first parameter is the key ID
Trades[UnitPrice] Example: column tables. The second parameter is the column that holds the parent
› Where to find Profit = Trades[Quantity]*Trades[UnitPrice] ) ID of the row. The result of this function then looks like this: 1|2|3|4
› Power BI, Power Pivot for Excel, Microsoft Analysis Services CALCULATE ( SELECTEDMEASURE(), Syntax: PATH( <ID_columnName>, <parent_columnName> )
› Purpose If I place this calculation in a table Trades[Dealer] = 1) › PATHITEM – Returns a specific item based on the specified position
› DAX was created to enumerate formulas across the data Measures without a Country column, then the
result will be 5,784,491.77. With this column, we get "Total"
› From a visual point of view, the Calculation Group looks like a
from the string, resulting from the PATH function. Positions are counted
from left to right. The inverted view uses the PATHITEMREVERSE function.
model, where the data is stored in the form of tables, which table with just two columns, "Name," "Ordinal," and rows Syntax: PATHITEM( <path>, <position>[, <type>] )
can be linked together through the sessions. They may have a › They do not compare row-based calculations, but they the same as the previous calculation. Still, the individual
that indicate the individual Calculation Items. › PATHILENGTH – Returns the number of parent elements to the specified
cardinality of either 1: 1, 1: N, or M: N and your direction, perform aggregation of row-based values input contexts that records provide us with a FILTER context that filters in item in given the PATH result, including itself.
› In addition to facilitating the reusability of the prepared
which decides which table filters which. These sessions are the environment passes to the calculation. Because of this, calculating the input the SUMX function's input. They behave Syntax: PATHLENGTH( <path> )
expressions also provide the ability to modify the output
either active or inactive. The active session is automatically there can be no pre-counting result. It must be evaluated the same way, for example, AXES in the chart. › PATHCONTAINS – Returns true if the specified item is specified exists in
format of individual calculations. Within this section, “Format the specified PATH path.
and participates in the calculation. The inactive is involved in only at the moment when Measure is called. › The filter context is can be adjusted with various functions,
String Expression ”often uses the DAX function Syntax: PATHCONTAINS( <path>, <item> )
this when it is activated, for example, by a function › The condition is that they must always be linked to the table such as FILTER,ALL, ALLSELECTED
SELECTEDMEASUREFORMATSTRING(), which returns a format
USERELATIONSHIP() to store their code, which is possible at any time alter. › Row context - Unlike the previous one, this context does not
DAX Queries
string associated with the Measures being evaluated.
Because their calculation is no longer directly dependent, it is filter the table. It is used to iterate over tables and evaluate
common practice to have one separate Measure Table, which values columns. They are typical, but at the same time,
groups all Measures into myself. For clarity, they are specific example calculated columns that are calculated from
Example: › The basic building block of DAX queries is the expression
therefore further divided into folders. data that are valid for the table row being evaluated. In
EVALUATE followed by any expression whose output is a
particular that, manual creation is not required when creating
VAR _selectedCurrency = SELECTEDVALUE( Trades[Currency] ) table.
Example of Measure: the line context because DAX makes it. Above the mentioned RETURN Example:
example with the use of SUMX also hides in itself line context.
Basic concepts SalesVolume = SUM (Trades[Quantity]) SELECTEDMEASUREFORMATSTRING() & „ “ & _selectedCurrency
Because SUMX is the function for that specified, the table in EVALUATE
the first argument performs an iterative pass and evaluates › In Power BI, they can all be evaluated pre-prepared items, or ALL (Trades[Dealer] )
it is possible, for
› Constructs and their notation
› Table – ‘Table‘
Variables the calculation line by line. The line context is possible to use
even nested. Or, for each row of the table, evaluates each row example, to use the › The EVALUATE statement can be divided into three primary
cross-section to define sections. Each section has its specific purpose and its
› Column – [Column] -> ‘Table‘[Column] of a different table.
items that are currently introductory word.
› Measure – [NameOfMeasure] › Variables in DAX calculations allow avoiding › Definition – It always starts with the word DEFINE. This section defines
being evaluated
repeated recalculations of the same procedure.
Calculate type function
› Comments local entities such as tables, columns, variables, and measures. There can
› Sometimes, however, it is be one section definition for an entire query, although a query can contain
› Single-line (CTRL + ´) – // or -- Which might look like this: necessary to enable the evaluation of Calculation Items only multiple EVALUATEs
› Multi-line – /* */ NumberSort = for Specific Measures. In that case, it is possible to use the › Query – It always starts with the word EVALUATE. This section contains
› CALCULATE, and CALCULATETABLE are functions that can the table expression to evaluate and return as a result.
› Data types VAR _selectedNumber = ISSELECTEDMEASURE() function, whose output is a value of type
SELECTEDVALUE( Table[Number] ) programmatically set the context filter. In addition to this › Result – This is a section that is optional and starts with the word ORDER
› INTEGER boolean or the SELECTEDMEASURENAME() function that returns
RETURN feature converts any existing line context to a context filter. BY. It contains the possibility to sort the result based on the inserted
› DECIMAL the name of the currently inserted measure as a string. inputs.
IF( _selectedNumber < 4, _selectedNumber, 5 ) › Calculate and Calculatetable syntax:
› CURRENCY CALCULATE / CALCULATETABLE ( Example:
› DATETIME
› BOOLEAN
› Their declaration uses the word VAR after followed
)
<expression> [, <filter1> [, … ]]
Conditions DEFINE
by the name "=" and the expression. The first using VAR _tax = 0.79
› STRING › The section filter within the Calculate expression is NOT of
the word VAR creates a section for DAX where type boolean but Table type. Nevertheless, boolean can be
› Like most languages, DAX uses the IF function. Within this EVALUATE
› VARIANT (not implemented in Power BI) language, it is defined by syntax: ADDCOLUMNS(
› BINARY possible declare such variables 1 to X. Individual used as an argument. Trades,
IF ( <logical_test>, <value_if_true>[, <value_if_false>])
variables always require a comment for their › Example of using the calculate function in a cumulative Where false, the branch is optional. The IF function explicitly
„AdjustedpProfit“,
› DAX can work very well with some types as well combined as ( Trades[Quantity] * Trades[UnitPrice] ) * _tax
declaration VAR before setting the name. To end this calculation the sum of sales for the last 12 months: evaluates only a branch that is based on the result of a logical
if it were the same type. If so, for example, the DATETIME and CALCULATE (
)
INTEGER data types are supported operator "+" then it is section, the word RETURN that it defines is a SUM ( Trades[Quantity] ), test relevant. ORDER BY [AdjustedpProfit]
possible to use them together. necessary return point for calculations. DATESINPERIOD( › If both branches need to be evaluated, then there is a function › This type of notation is used, for example, in DAX Studio
› Variables are local only. DateKey[Date], IF.EAGER() whose syntax is the same as IF itself but (daxstudio.org). It is a publicly available tool that provides free
Example: DATETIME ( [Date] ) + INTEGER ( 1 ) = DATETIME ( [Date] + 1) MAX ( DateKey[Date] ),
› If there is a variable in the formula that is not used evaluates as: access to query validation, code debugging, and query
-1, VAR _value_if_true = <value_if_true>
to get the result, this variable does not evaluate. performance measurement.
Operators
YEAR VAR _value_if_false = <value_if_false>
)) › DAX studio has the ability to connect directly to
(Lazy Evaluation) RETURN
IF (<logical_test>, _value_if_true, _value_if_false) Analysis Services, Power BI a Power Pivot for Excel
› Evaluation of variables is performed based on › Syntax Sugar: › IF has an alternative as IFERROR. Evaluates the expression
› Arithmetic { + , - , / , * , ^ } evaluated context instead of the context in which › [TradeVolume](Trades[Dealer] = 1)
Recommended sources
and return the output from the <value_if_error> branch only if
› Comparative { = , == , > , < , >= , <= , <> } the variable is used directly. Within one, The = the expression returns an error. Otherwise, it returns the
CALCULATE ( [TradeVolume], Trades[Dealer] = 1)
› Joining text { & } expression can be multiple VAR / RETURN sections value of the expression itself.
= › Marco Russo & Alberto Ferrari
› Logic { && , II , IN, NOT } that always serve to evaluate the currently CALCULATE ( [TradeVolume], FILTER ( › DAX supports concatenation of conditions, both using
› Daxpatterns.com
› Prioritization { ( , ) } evaluated context. ALL (Trades[Dealer] ) , submerged ones IF, so thanks to the SWITCH function. It
› dax.guide
› They can store both the value and the whole table Trades[Dealer] = 1) ) evaluates the expression against the list values ​and returns one › The Definitive Guide to DAX
of several possible result expressions.

JAK NA POWER BI CHEATSHEET


POWER QUERY

What is Power Query? Data values let expression Custom function Syntax Sugar
Each value type is associated with a literal syntax, a set of values The expression let is used to capture the value from an Example of custom function entries: › Each is essentially a syntactic abbreviation for declaring non-
“An IDE for M development“ ​of that type, a set of operators defined above that set of values, intermediate calculation in a named variable. These named (x, y) => Number.From(x) + Number.From(y) type functions, using a single formal parameter named.
and an internal type attributed to the newly created values. variables are local in scope to the `let` expression. The Therefore, the following notations are semantically
equivalent:
Components
› Null – null construction of the term let looks like this: (x) =>
› Logical – true, false let let let
› Number – 1, 2, 3, ... name_of_variable = <expression>, out = Number.From(x) + Source = ...,
› Ribbon – A ribbon containing settings and pre-built features by Power › Time – #time(HH,MM,SS) Number.From(Date.From(DateTime.LocalNow()))
addColumn = Table.AddColumn(Source, „NewName“, each [field1] + 1)
Query itself rewrites in M ​language for user convenience. returnVariable = <function>(name_of_variable) in
› Date – #date(yyyy,mm,ss) in in
› Queries – simply a named M expression. Queries can be moved into addColumn
› DateTime – #datetime(yyyy,mm,dd,HH,MM,SS) returnVariable out ------------------------------------------------------------------------------------------------------------------------------------------------------------------
groups
› DateTimeZone – let
› Primitive – A primitive value is a single-part value, such as a number, When it is evaluated, the following always applies: The input argumets to the functions are of two types: Source = ...,
#datetimezone(yyyy,mm,dd,HH,MM,SS, 9,00)
logical, date, text, or null. A null value can be used to indicate the absence › Required – All commonly written argumets in (). Without add1ToField1 = (_) => [field1] + 1,
of any data. › Duration – #duration(DD,HH,MM,SS) › Expressions in variables define a new range containing
these argumets, the function cannot be called. addColumn(Source,“NewName“,add1ToField1)
› List – The list is an ordered sequence of values. M supports endless lists. › Text – “text“ identifiers from the production of the list of variables and must in
› Optional – Such a parameter may or may not be to function to
Lists define the characters “{“ and “}“ indicate the beginning and the end of › Binary – #binary(“link“) be present when evaluating terms within a list variables. The
enter. Mark the parameter as optional by placing text before The second piece of syntax sugar is that bare square brackets are syntax
the list. › List – {1, 2, 3} expressions in the list of variables are they can refer to each
the argument name “Optional“. For example (optional x). If it sugar for field access of a Record named `_`.
› Record – A record is a set of fields, where the field is a pair of which form › Record – [ A = 1, B = 2 ] other
the name and value. The name is a text value that is in the field record does not happen fulfillment of an optional argument, so be the
unique.
› Table – A table is a set of values ​arranged in named columns and rows.
› Table – #table({columns},{{first row contenct},{}…})*
› Function – (x) => x + 1
› All variables must be evaluated before the term let is evaluated.
› If expressions in variables are not available, let will not be
same for for calculation purposes, but its value will be null. Query Folding
Optional arguments must come after required arguments.
Table can be operated on as if it is a list of records, or as if it is a record of
› Type – type { number }, type table [ A = any, B = text ] evaluated
* The index of the first row of the table is the same as for the records in sheet 0 › Errors that occur during query evaluation propagate as an error As the name implies, it is about composing. Specifically, the
lists. Table[Field]` (field reference syntax for records) returns a list of values in Arguments can be annotated with `as <type>` to indicate
that field. `Table{i}` (list index access syntax) returns a record representing a to other linked queries. steps in Power Query are composed into a single query, which
required type of the argument. The function will throw a type
row of the table. is then implemented against the data source. Data sources
› Function – A function is a value that when called using arguments creates a Operators error if called with arguments of the wrong type. Functions can
that supports Query folding are resources that support the
new value. Functions are written by listing the function argumets in
parentheses, followed by the transition symbol “=>“ and the expression
Conditions also have annotated return of them. This annotation is provided
as:
concept of query languages as relational database sources.
There are several operators within the M language, but not every This means that, for example, a CSV or XML file as a flat file
defining the function. This expression usually refers to argumets by (x as number, y as text) as logical => <expression>
name. There are also functions without argumets.
operator can be used for all types of values. Even in Power Query, there is an “If“ expression, which, based with data will definitely not be supported by Query Folding.
› Parameter – The parameter stores a value that can be used for › Primary operators on the inserted condition, decides whether the result will be a The return of the functions is very different. The output can be a Therefore, the transformation does not have to take place
transformations. In addition to the name of the parameter and the value it › (x) – Parenthesized expression true-expression or a false-expression. sheet, a table, one value but also other functions. This means until after the data is loaded, but it is possible to get the data
stores, it also has other properties that provide metadata. The undeniable › x[i] – Field Reference. Return value from record, list of values that one function can produce another function. Such a function ready immediately. Unfortunately, not every source supports
Syntactic form of If expression:
advantage of the parameter is that it can be changed from the Power BI from table. is written as follows: this feature.
Service environment without the need for direct intervention in the data if <predicate> then < true-expression > else < false-expression >
› x{i} – Item access. Return value from list, record from table. › Valid functions
set. Syntax of parameter is as regular query only thing that is special is that “else is required in M's conditional expression “ let first = (x)=> () => let out = {1..x} in out in first
“Placing the “?“ Character after the operator returns null if the › Remove, Rename columns
the metadata follows a specific format.
index is not in the list “ Condition entry: When evaluating functions, it holds that: › Row filtering
› Formula Bar – Displays the currently loaded step and allows you to edit › Grouping, summarizing, pivot and unpivot
› x(…) – Function invocation If x > 2 then 1 else 0 › Errors caused by evaluating expressions in a list of
it.To be able to see formula bar, It has to be enabled in the ribbon menu › Merge and extract data from queries
inside View category. › {1 .. 10} – Automatic list creation from 1 to 10 If [Month] > [Fiscal_Month] then true else false expressions or in a function expression will propagate › Connect queries based on the same data source
› Query settings – Settings that include the ability to edit the name and › … – Not implemented If expression is the only conditional in M. If you have multiple further either as a failure or as an “Error“ value › Add custom columns with simple logic
description of the query. It also contains an overview of all currently applied › Mathematical operators – +, -, *, / predicates to test, you must chain together like: › The number of arguments created from the argument › Invalid functions
steps. Applied Steps are the variables defined in a let expression and they › Comparative operators › Merge queries based on different data sources
if <predicate> list must be compatible with the formal argumets of
are represented by varaibles names. › Adding columns with Index
› > , >= – Greater than, greater than or equal to then < true-expression > the function, otherwise an error will occur with reason
› Data preview – A component that displays a preview of the data in the › Change the data type of a column
currently selected transformation step.
› < , <= – Less than, less than or equal to else if <predicate> code “Expression.Error“
› = , <> – is equal, is not equal. Equal returns true even for then < false-true-expression >
DEMO
› Status bar – This is the bar located at the bottom of the screen. The row
contains information about the approximate state of the rows, columns,
and time the data was last reviewed. In addition to this information, there is
null = null
› Logical operators
else < false-false-expression >
When evaluating the conditions, the following applies:
Recursive functions
profiling source information for the columns. Here it is possible to switch › and – short-circuiting conjunction › Operators can be combined. For example, as follows:
the profiling from 1000 rows to the entire data set. › or – short-circuiting disjunction › If the value created by evaluating the if a condition is not a For recursive functions is necessary to use the character “@“
logical value, then an error with the reason code
› LastStep[Year]{[ID]}
› not – logical negation which refers to the function within its calculation. A typical
*This means that you can get the
Functions in Power Query › Type operators “Expression.Error„ is raised recursive function is the factorial. The function for the factorial
value from another step based on the index of the column
› as – Is compatible nullable-primitive type or error › A true-expression is evaluated only if the if condition can be written as follows:
Knowledge of functions is your best helper when working with › is – Test if compatible nullable-primitive type evaluates to true. Otherwise, false-expression is evaluated. let › Production of a DateKey dimension goes like this:
a functional language such as M. Functions are called with › Metadata - The word meta assigns metadata to a value. › If expressions in variables are not available, they must not be Factorial = (x) => #table(
parentheses. Example of assigning metadata to variable x: evaluated if x = 0 then 1 else x * @Factorial(x - 1), type table [Date=date, Day=Int64.Type, Month=Int64.Type,
› Shared – Is a keyword that loads all functions “x meta y“ or “x meta [name = x, value = 123,…]“ › The error that occurred during the evaluation of the condition Result = Factorial(3) MonthName=text, Year=Int64.Type,Quarter=Int64.Type],
(including help and example) and enumerators in Within Power Query, the priority of the operators applies, so for example will spread further either in the form of a failure of the entire in List.Transform(
result set. The call of function is made inside empty “X + Y * Z“ will be evaluated as “X + (Y * Z)“ query or “Error“ value in the record. Result // = 6 List.Dates(start_date, (start_date-endd_ate),
query using by = # shared #duration(1, 0, 0 ,0)),
each {_, Date.Day(_), Date.Month(_),
Comments The expression try… otherwise Each Date.MonthName(_), Date.Year(_), Date.QuarterOfYear(_)}
))
Functions can be divided into two categories: Capturing errors is possible, for example, using the try Functions can be called against specific arguments. However, if
M language supports two versions of comments:
› Prefabricated – Example: Date.From() expression. An attempt is made to evaluate the expression
Keywords
› Single-line comments – can be created by // before code the function needs to be executed for each record, an entire
› Custom – these are functions that the user himself prepares after the word try. If an error occurs during the evaluation, the sheet, or an entire column in a table, it is necessary to append
for the model by means of the extension of the notation by › Shortcut: CTRL + ´ expression after the word otherwise is applied the word each to the code. As the name implies, for each
„()=> “, where the argumets that will be required for the › Multi-line comments – can be created by /* before code and and, as, each, else, error, false, if, in, is, let, meta, not,
Syntax example: context record, it applies the procedure behind it. Each is never
evaluation of the function can be placed in parentheses. */ after code otherwise, or, section, shared, then, true, try, type, #binary,
try Date.From([textDate]) otherwise null required! It simply makes it easier to define a function in-line
When using multiple argumets, it is necessary to separate › Shortcut: ALT + SHIFT + A #date, #datetime, #datetimezone, #duration, #infinity, #nan,
for functions which require a function as their argument.
them using a delimiter. #sections, #shared, #table, #time

JAK NA POWER BI CHEATSHEET


7.3 Data Visualization
> Part-to-whole charts
Pie chart Donut pie chart Heat maps Stacked column chart Treemap charts

The Data Visualization Cheat Sheet

Learn Data Visualization online at www.DataCamp.com One of the most common ways to
show part to whole data. It is also
The donut pie chart is a variant of the
pie chart, the difference being it has a
Heatmaps are two-dimensional charts
that use color shading to represent
Best to compare subcategories within
categorical data. Can also be used to
2D rectangles whose size is
proportional to the value being
commonly used with percentages hole in the center for readability data trends. compare percentages measured and can be used to display
hierarchically structured data

Use cases Use cases Use cases Use cases Use cases

How to use this cheat sheet


Voting preference by age grou Android OS market shar Average monthly temperatures Quarterly sales per regio Grocery sales count with
Market share of cloud providers Monthly sales by channel across the year Total car sales by producer categorie
Departments with the highest Stock price comparison by
amount of attrition over time industry and company
Use this cheat sheet for inspiration when making your next data visualizations. For more data visualization cheat sheets,
check out our cheat sheets repository here.

> Capture a trend > Visualize a single value > Capture distributions
Line chart Multi-line chart Area chart Stacked area chart Spline chart Card Table chart Gauge chart Histogram Box plot Violin plot Density plot

$7.47M
Total Sales

Cards are great for showing Best to be used on small This chart is often used in Shows the distribution of a Shows the distribution of a A variation of the box plot.
Visualizes a distribution by
The most straightforward way to Captures multiple numeric Shows how a numeric value Most commonly used variation of Smoothened version of a line chart. and tracking KPIs in datasets, it displays tabular executive dashboard reports variable. It converts variable using 5 key It also shows the full using smoothing to allow
capture how a numeric variable is variables over time. It can include progresses by shading the area area charts, the best use is to track It differs in that data points are dashboards or presentations data in a table
to show relevant KPIs numerical data into bins as summary statistics— distribution of the data smoother distributions and
changing over time multiple axes allowing comparison between line and the x-axis the breakdown of a numeric value connected with smoothed curves columns. The x-axis shows minimum, first quartile, alongside summary statistics better capture the
of different units and scale ranges by subgroups to account for missing values, as the range, and the y-axis median, third quartile, and distribution shape of the data
opposed to straight lines represents the frequency maximum

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Revenue in $ over tim Apple vs Amazon stocks Total sales over tim Active users over time by Electricity consumption over Revenue to date on a Account executive NPS score Distribution of salaries in Gas efficiency of vehicle Time spent in restaurants Distribution of price of
Energy consumption in kWh over tim Active users over time segmen tim sales dashboar leaderboard Revenue to target an organizatio Time spent reading across across age group hotel listing
over tim Lebron vs Steph Curry Total revenue over time by CO2 emissions over time Total sign-ups after a Registrations per webinar Distribution of height in readers Length of pill effects by Comparing NPS scores by
Google searches over time searches over tim country promotion one cohort dose customer segment
Bitcoin vs Ethereum price
over time

> Visualize relationships > Visualize a flow


Bar chart Column chart Scatter plot Connected scatterplot Bubble chart Word cloud chart Sankey chart Chord chart Network chart

Data Analyst
Science
Engineer

One of the easiest charts to Also known as a vertical bar Most commonly used chart A hybrid between a scatter Often used to visualize data A convenient visualization for Useful for representing flows in Useful for presenting Similar to a graph, it
read which helps in quick
comparison of categorical
chart, where the categories
are placed on the x-axis.
when observing the
relationship between two
plot and a line plot, the
scatter dots are connected
points with 3 dimensions,
namely visualized on the x-
visualizing the most prevalent
words that appear in a text
systems. This flow can be any
measurable quantity

weighted relationships or
flows between nodes.
consists of nodes and
interconnected edges. It
Learn Data Skills Online at
data. One axis contains These are preferred over bar variables. It is especially with a line axis, y-axis, and with the size Especially useful for illustrates how different www.DataCamp.com
categories and the other axis charts for short labels, date useful for quickly surfacing of the bubble. It tries to show highlighting the dominant or items have relationships
represents values ranges, or negatives in values potential correlations relations between data points important flows
with each other
between data points using location and size

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Volume of google Brand market shar Display the relationship Cryptocurrency price Adwords analysis: CPC vs Top 100 used words by Energy flow between Export between countries How different airports are
searches by regio Profit Analysis by region between time-on-platform inde Conversions vs Share of customers in customer countrie to showcase biggest connected worldwide
Market share in revenue and chur Visualizing timelines and total conversion service tickets Supply chain volumes export partner Social media friend group
by product Display the relationship events when analyzing Relationship between life between warehouses Supply chain volumes analysis
between salary and years two variables expectancy, GDP per between the largest
spent at company capita, & population size warehouses
Deviation Correlation Ranking Distribution Change over Time Part-to-whole Magnitude Spatial Flow
Emphasise variations (+/-) from a Show the relationship between two or Use where an item’s position in an Show values in a dataset and how Give emphasis to changing trends. Show how a single entity can be Show size comparisons. These can be Used only when precise locations or Show the reader volumes or intensity
fixed reference point. Typically the more variables. Be mindful that, unless ordered list is more important than its often they occur. The shape (or ‘skew’) These can be short (intra-day) broken down into its component relative (just being able to see geographical patterns in data are of movement between two or more
reference point is zero but it can also you tell them otherwise, many readers absolute or relative value. Don’t be of a distribution can be a memorable movements or extended series elements. If the reader’s interest is larger/bigger) or absolute (need to more important to the reader than states or conditions. These might be
be a target or a long-term average. will assume the relationships you afraid to highlight the points of way of highlighting the lack of traversing decades or centuries: solely in the size of the components, see fine differences). Usually these anything else. logical sequences or geographical
Can also be used to show sentiment show them to be causal (i.e. one interest. uniformity or equality in the data. Choosing the correct time period is consider a magnitude-type chart show a ‘counted’ number (for example, locations.
(positive/neutral/negative). causes the other). important to provide suitable context instead. barrels, dollars or people) rather than
for the reader. a calculated rate or per cent. Example FT uses
Locator maps, population density, Example FT uses
Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses Example FT uses natural resource locations, natural Movement of funds, trade, migrants,
Trade surplus/deficit, climate change Inflation & unemployment, income & Wealth, deprivation, league tables, Income distribution, population Share price movements, economic Fiscal budgets, company structures, Commodity production, market disaster risk/impact, catchment areas, lawsuits, information; relationship
life expectancy constituency election results (age/sex) distribution time series national election results capitalisation variation in election results graphs.

Diverging bar Scatterplot Ordered bar Histogram Line Stacked column Column Basic choropleth (rate/ratio) Sankey
A simple standard The standard way to Standard bar charts The standard way to The standard way to A simple way of The standard way to The standard approach Shows changes in flows
bar chart that can show the relationship display the ranks of show a statistical show a changing time showing part-to-whole compare the size of for putting data on a from one condition to
handle both negative between two values much more distribution - keep the series. If data are relationships but can things. Must always map – should always be at least one other; good
and positive continuous variables, easily when sorted gaps between columns irregular, consider be difficult to read with start at 0 on the axis. rates rather than totals for tracing the eventual
magnitude values. each of which has its into order. small to highlight the markers to represent more than a few and use a sensible base outcome of a complex
own axis. ‘shape’ of the data. data points. components. geography. process.

Diverging stacked bar Line + Column Ordered column Boxplot Column Proportional stacked bar Bar Proportional symbol (count/magnitde) Waterfall
Perfect for A good way of See above. Summarise multiple Columns work well A good way of See above. Good Use for totals rather Designed to show the
presenting survey showing the distributions by for showing change showing the size and when the data are not than rates – be wary sequencing of data
results which involve relationship between showing the median over time - but proportion of data at time series and labels that small differences through a flow
sentiment (eg an amount (columns) (centre) and range of usually best with only the same time – as have long category in data will be hard to process, typically
disagree/neutral/ and a rate (line). the data one series of data at long as the data are names. see. budgets. Can include
agree). a time. not too complicated. +/- components.

Spine chart Connected scatterplot Ordered proportional symbol Violin plot Line + column Pie Paired column Flow map Chord
Splits a single value Usually used to show Use when there are big Similar to a box plot A good way of A common way of As per standard For showing A complex but
into 2 contrasting how the relationship variations between but more effective with showing the showing part-to-whole column but allows for unambiguous powerful diagram
components (eg between 2 variables values and/or seeing complex distributions relationship over time data – but be aware multiple series. Can movement across a which can illustrate
Male/Female). has changed over fine differences (data that cannot be between an amount that it’s difficult to become tricky to read map. 2-way flows (and net
time. between data is not so summarised with (columns) and a rate accurately compare the with more than 2 winner) in a matrix.
important. simple average). (line). size of the segments. series.

Surplus/deficit filled line Bubble Dot strip plot Population pyramid Stock price Donut Paired bar Contour map Network
The shaded area of Like a scatterplot, but Dots placed in order A standard way for Usually focused on Similar to a pie chart – See above. For showing areas of Used for showing
these charts allows a adds additional detail on a strip are a showing the age and day-to-day activity, but the centre can be equal value on a map. the strength and
balance to be shown by sizing the circles space-efficient sex breakdown of a these charts show a good way of making Can use deviation inter-connectdness
– either against a according to a third method of laying out population distribution; opening/closing and space to include more colour schemes for of relationships of
baseline or between variable. ranks across multiple effectively, back to back hi/low points of each information about the showing +/- values varying types.
two series. categories. histograms. day. data (eg. total).

XY heatmap Slope Dot strip plot Slope Treemap Proportional stacked bar Equalised cartogram
A good way of showing Perfect for showing Good for showing Good for showing Use for hierarchical A good way of Converting each unit on
the patterns between 2 how ranks have individual values in a changing data as long part-to-whole showing the size and a map to a regular and
categories of data, less changed over time or distribution, can be a as the data can be relationships; can be proportion of data at equally-sized shape –
good at showing fine vary between problem when too simplified into 2 or 3 difficult to read when the same time – as good for representing
differences in amounts. categories. many dots have the points without missing there are many small long as the data are voting regions with
same value. a key part of story. segments. not too complicated. equal value.
Lollipop chart Dot plot Area chart Scaled cartogram (value)
Voronoi Proportional symbol
Lollipops draw more A simple way of Use with care – these Stretching and
attention to the data showing the change are good at showing A way of turning Use when there are
shrinking a map so
value than standard or range (min/max) changes to total, but points into areas – big variations between
that each area is
bar/column and can of data across seeing change in any point within each values and/or seeing
sized according to a
also show rank and multiple categories. components can be area is closer to the fine differences
particular value.
value effectively. very difficult. central point than between data is not so
any other centroid. important.

Barcode plot Fan chart (projections) Sunburst Isotype (pictogram) Dot density
Like dot strip plots, Use to show the Another way of Excellent solution in Used to show the
good for displaying uncertainty in future visualisaing hierarchical some instances – use location of individual
all the data in a projections - usually part-to-whole only with whole events/locations –
table,they work best this grows the further relationships. Use numbers (do not slice make sure to annotate
when highlighting forward to projection. sparingly (if at all) for off an arm to any patterns the
individual values. obvious reasons. represent a decimal). reader should see.

Cumulative curve Connected scatterplot Arc Lollipop chart Heat map


A good way of A good way of A hemicycle, often Lollipop charts draw Grid-based data values
showing how unequal showing changing used for visualising more attention to the mapped with an
a distribution is: y axis data for two variables political results in data value than intensity colour scale.
is always cumulative whenever there is a parliaments. standard bar/column – As choropleth map –
frequency, x axis is relatively clear pattern does not HAVE to start but not snapped to an
always a measure. of progression. at zero (but preferable). admin/political unit.

Visual
Calendar heatmap Gridplot Radar chart
A great way of Good for showing % A space-efficient way
showing temporal information, they of showing value pf
patterns (daily, weekly, work best when used multiple variables– but
monthly) – at the on whole numbers make sure they are
expense of showing and work well in organised in a way that
precision in quantity. multiple layout form. makes sense to reader.

vocabulary
Priestley timeline Venn Parallel coordinates
Great when date and Generally only used An alternative to radar
duration are key for schematic charts – again, the
elements of the story representation. arrngement of the
in the data. variables is important.
Usually benefits from
highlighting values.

Circle timeline Waterfall


Good for showing Can be useful for
Designing with data discrete values of showing part-to-whole
varying size across relationships where
multiple categories some of the
(eg earthquakes by components are
There are so many ways to visualise data - how do we contintent). negative.
know which one to pick? Use the categories across the
Seismogram
top to decide which data relationship is most important in Another alternative
your story, then look at the different types of chart within to the circle timeline
for showing series
the category to form some initial ideas about what might where there are big
variations in the data.
work best. This list is not meant to be exhaustive, nor a
wizard, but is a useful starting point for making
informative and meaningful data visualisations.
FT graphic: Alan Smith; Chris Campbell; Ian Bott; Liz Faunce;
Graham Parrish; Billy Ehrenberg; Paul McCallum; Martin Stabe
Inspired by the Graphic Continuum by Jon Schwabish and Severino Ribecca

ft.com/vocabulary
8. Emerging Technologies
Top 10 Technology Trends
Artificial Intelligence (AI) Cyber Security
Simulation of human intelligence processes by machines, Application of technologies, processes and controls to protect
which includes machine, deep, and reinforcement learning, systems, networks, programs, devices and data from cyber
attacks
natural language processing and computer vision
Blockchain
Internet of Things (IoT) Technology that facilitates the process of recording
Networking capability that allows information to
1 2 transactions and tracking assets in a business
be sent to and received from objects and devices network, where an asset can be tangible ( house,
(things).
10 3 car, land) or intangible (patents, copyrights,
branding)

Robotic Process Automation (RPA) 9 4 Extended Reality (XR)


Technology that mimics the way humans interact
Technology that represents all human-machine
with software to perform high-volume, repeatable
interactions, which includes augmented reality
tasks 8 5 (AR), mixed reality (MR) and virtual reality (VR)

Cloud Computing 7 6
Delivery of different services through the Internet,
including data storage, servers, databases,
5th Generation Network (5G)
5G is the fifth generation of wireless technology,
networking, and software.
which can provide higher speed, lower latency and
greater capacity than 4G LTE networks
Quantum Computing 3D Printing
Machines that use the properties of quantum physics to store data Also known as additive manufacturing, is a method of
and perform computations, which can result in a computer 158 Prepered by: Dr. Mejdal creating a three dimensional object layer-by-layer using a
million times faster than the most sophisticated supercomputer we computer created design.
have in the world today
8.1 Artificial Intelligence (AI)
Machine Learning Algorithm Cheat Sheet
This cheat sheet helps you choose the best machine learning algorithm for your predictive analytics solution.
Your decision is driven by both the nature of your data and the goal you want to achieve with your data.

Predict between Multiclass Classification


several categories Answers complex questions with
Text Analytics Extract information from text multiple possible answers
Answers questions like: Is this A or B or C or D?
Derives high-quality information from text
Multiclass Logistic Fast training times,
Answers questions like: What info is in this text? Regression linear model

What do you want to do?


Latent Dirichlet Unsupervised topic modeling, Multiclass Neural Accuracy, long training times
Allocation group texts that are similar Network

Extract N-Gram Creates a dictionary of n-grams Multiclass Decision Accuracy, fast training times
Features from Text from a column of free text Forest
Predict between
One-vs-All Depends on the
Converts text data to integer two categories two-class classifier
Multiclass
Feature Hashing encoded features using the
Vowpal Wabbit library Generate recommendations One-vs-One Depends on binary classifier,
less sensitive to an imbalanced
Multiclass dataset with larger complexity
Performs cleaning operations on text,
Preprocess Text like removal of stop-words, case Recommenders Multiclass Boosted Non-parametric, fast
normalization Decision Tree training times and scalable
Predicts what someone will be interested in
Converts words to values for use in
Word2Vector Answers the question: What will they be interested in? Two-Class Classification
NLP tasks, like recommender, named
entity recognition, machine Use the Train Wide & Deep Hybrid recommender, both collaborative
translation Answers simple two-choice questions,
Recommender module filtering and content-based approach
like yes or no, true or false
Collaborative filtering, better performance Answers questions like: Is this A or B?
SVD Recommender
with lower cost by reducing dimensionality Two-Class Support Under 100 features,
Predict Vector Machine linear model
Regression
values
Makes forecasts by estimating the Two-Class Averaged
Discover structure Perceptron
Fast training, linear model
relationship between values
Answers questions like: How much or how many? Two-Class Decision Accurate, fast training
Clustering Forest
Fast Forest Quantile
Predicts a distribution
Regression Separates similar data points into intuitive groups Two-Class Logistic
Fast training, linear model
Answers questions like: How is this organized? Regression
Poisson Regression Predicts event counts
K-Means Unsupervised learning Classify Two-Class Boosted Accurate, fast training,
Decision Tree large memory footprint
images
Linear Regression Fast training, linear model
Two-Class Neural Accurate, long training
Bayesian Linear Find unusual occurrences Network times
Linear model, small data sets
Regression
Image Classification
Decision Forest
Regression
Accurate, fast training times Anomaly Detection Classifies images with popular networks
Answers questions like: What does this image represent?
Neural Network Identifies and predicts rare or unusual data points
Accurate, long training times
Regression Answers the question: Is this weird? ResNet Modern deep
learning neural
Boosted Decision Accurate, fast training times, Under 100 features, PCA-Based Anomaly
One Class SVM Fast training times DenseNet network
Tree Regression large memory footprint aggressive boundary Detection

© 2021 Microsoft Corporation. All rights reserved. Share this poster: aka.ms/mlcheatsheet
8.2 Internet of Things (IoT)
Ch
ea
t sh
ee
t
Internet of Things
The Internet of Things (IoT) refers to a system of interrelated, internet-connected objects that can collect and
transfer data over a wireless network without human intervention. IoT devices are basically "smart" devices that
make "smart" systems (for e.g., smart home, smart factory, smart farms, smart cities, etc.).

Overview Challenge Solution Results

This also resulted in


IoT sensors on predictive data
agricultural machinery analysis from the
When a machine is not were connected to the information provided.
If machines could talk functioning properly, cloud. Managers & Engineers started to
- and tell you when productivity drops. workers received build a picture and
they were going to How can you tell, immediate predict when the
break down. quickly, that notifications when the machine would break
something is wrong? machine was not and plan maintenance.
functioning properly. This increased
productivity to 100%.

Staff shortages for a Rather than manually


civil engineering taking readings and Improved efficiencies,
Remote Monitoring - company reduced the producing reports, enabling the company
Improving work number of onsite visits data was collected on to take on more
performance. that could be carried IoT sensors, uploaded projects and increase
out. to the cloud and this their customer base.
generated a report.

"Work still on-going on


Blinds that are IoT IoT products at Bloc
enabled - features Blinds as it is an
include a camera, exciting space with
Blinding the Consumers want thermometer, lighting loads of potential!"
competition - how to smarter, simpler and sensors, allowing
get a USP with IoT. everyday devices. for seamless Read how
integration into the Magherafelt's Bloc
home environment. Blinds has
incorporated IoT
technologies
8.3 Robotic Process Automation (RPA)
ROBOTIC PROCESS Blue Prism
It is built with a simple concept by replicating user activity. SQL Server •
RPA Tool Factors
Technology: Many organizations perform the tasks
VIRTUAL
EMPLOYEE

AUTOMATION is the tool where data is stored

OBJECT
outside the local system using Virtual Machines or
Citrix. Hence, the tool must be platform independent

CHEAT SHEET
STUDIO and support any type of application
Applications PROCESS DOCUMENTS
NO PHYSICAL
BUSINESS VIRTUAL STUDIO • Scalability: How quickly and easily the tool responds EVERY STEP
BOT
User Interface BUSINESS CONSISTENTLY
LOGIC
OBJECT (VBO) to the business requirements
BUSINESS
VIRTUAL
BLUE PRISM
• Security: Implementations of security controls must
User Interface BUSINESS

RPA Basics
LOGIC PROCESS
OBJECT (VBO) be measured
CONTROL
BUSINESS
LOGIC
User Interface
VIRTUAL
BUSINESS
OBJECT (VBO)
BLUE PRISM
DATABASE
ROOM

SYSTEM
• Total Cost: The initial set-up cost, vendor license fee,
maintenance cost must be taken into consideration
RPA
MANAGER
while selecting the tool
RPA • Ease of use and control: The tool must be user
TIME-TO-
Automation friendly to increase the efficiency and employee
NO CHANGES IN
THE EXISTING
INFRASTRUCTU
MARKET
WITHIN A
RPA is a Robotic Process Automation which is used for automating the current workflows satisfaction RE NEEDED FEW WEEKS
with the help of robots to reduce human intervention at every point Anywhere • Vendor Experience: It improves the speed of
Robotic: Machines that mimics the human activities and actions are called as robots The architecture consists of client, bots and control room. Automation implementation by reducing the work required to
Process: Sequence of steps which is used to perform a particular task Anywhere is a software designed to virtually automate any computer implement RPA software USES THE
EXISTING

process. SQL express is a tool where data is stored. • Maintenance and support: To make sure that the APPLICATION
Automation: Any process which is done by robot without any human intervention and
required service level agreements are met
provides high degree of accuracy • Quick Deployment: The tool must be able to work as
BOT CREATOR AND DESKTOP BOT
APPLICATION RUNNER a real end-user by interacting with applications.
Global BOT 1
VENDOR Free Version Pricing Usability
Coverage
DEVELOPER 1
CONTROL ROOM
Applications
Another Monday ---- ---- ---- Europe DEVELOPER 2 BOT 2
Benefits of of RPA
AntWorks ---- ---- ---- ---- BOT 3
DEVELOPER 3 •

User Management Using RPA • It is used in customer service, to automate service order
Arago ---- ---- ---- ---- Source Control
management and quality reporting
• Dashboard
Automation Drag & Drop, • License Management • Reduces burden on IT: It does not disturb underlying • Travel and logistics: for ticket booking, passenger details and
---- Per Process Global
Anywhere Macro Recording legacy systems accounting
BluePrism ---- Per bot Drag & Drop ---- Ui Path • Reliability: As the bots can work 24*7 effectively • Human Resource: New employees joining formalities, payroll

Contextor EMEA & North • Cost cutting technology: It reduces the costs by process and hiring shortlisted candidates
---- ---- ---- It is a simple software automation and application integration expert.
reducing the size of manual workforce • Health care: In patient registrations and billing
America
UiPath consists of three parts • No coding required: To use RPA tools, a person need • Banking and Financial services: It can be used for card activation
Jidoka ---- ---- ---- ----
and fraud claims and discovery
Kofax ---- ---- ---- Global not have the programming skills
• UiPath studio • Government: Change of address and license renewal
• Accuracy: It functions with accuracy and is less prone
Kryon Systems ---- ---- ---- ----
• UiPath robots to errors
Nice systems ---- ---- ---- Global
• UiPath orchestrator • Productivity rate: Execution time is much faster than
Pega ---- ---- ---- Global
the manual process approach
Redwood Company
---- ---- ---- ---- directory server • Compliance: It follows the rules to provide audit free
software Browser (desktop) Mobile devices
trail
UiPath Community Drag & Drop, Information tier
UiPath Per bot Global • Consistency: Repetitive tasks are performed in the
Edition Macro Recording PORTAL
Logic tier same way
Visual Cron 45 day free trial Per server Drag & Drop ----
OASP sample
--- Access Manager • Increase employee engagement: It lets the employee FURTHERMORE:
WorkFusion Drag & Drop, application
Data tier
WorkFusion Per process Global to focus on value-added activities RPA Training using UiPath
RPA Express Macro Recording DB DB External Users Db
8.4 Cloud Computing
Cloud Computing – Summary Cheat sheet Version 2

What is Cloud Computing ???


Suitable for End Users

e.g. Gmail, Office365,..


SaaS

CONTROL
“Cloud Computing is an approach to offer IT Services to customers remotely is called Suitable for Developers
Cloud Computing” AWS, Google Apps Eng,..
PaaS
For Net Arch’s, IT Admins

e.g. MS Azure, Google CE


IaaS
www.networkwalks.com

CLOUD Types CLOUD Types


(Based on Location & Deployment) (Based on Services)

www.networkwalks.com

1. Public 2. Private 3. Hybrid 4. Community 1. Iaas 2. PaaS 3. SaaS 4. Other Types


www.networkwalks.com

Internal Cloud External Cloud STaaS SECaaS DaaS TEaaS APIaaS FaaS

Cloud Computing Models


Traditional IT
(On-Premises)
IaaS PaaS SaaS
Client Manages

Applications Applications Applications Applications


Client Manages

Data Data Data Data

Vendor Manages in Cloud


Runtime Runtime Runtime Runtime
Client Manages

Vendor Manages in Cloud

Middleware Middleware Middleware Middleware

O/S O/S O/S O/S


Vendor Manages in Cloud

Virtualization Virtualization Virtualization Virtualization

Servers Servers Servers Servers

Storage Storage Storage Static


Storage Routing
www.networkwalks.com

Networking Networking Networking Networking

Customization, Higher costs, Slower time to value


Standardization, Lower costs, Faster time to value

Advantages of Cloud Computing Disadvantages of Cloud Computing

↑ Cost Effective (Low CAPEX/OPEX) ↓ Less Secure


↑ Scalable, Standardized Model, ↓ Limited Control (especially
www.networkwalks.com

Highly Reliable SaaS/Paas Models)


↑ Quick in Deployment, Easy ↓ Requires Extra Bandwidth
Disaster Recovery, Flexible

New batch of online Cisco CCNA 200-301 is starting!


Enrol today with us for quality training: info@networkwalks.com

/Network Walks /NetworkWalks /company/networkwalks


Visit our website & You Channel for more FREE resources like Cheatsheets,
Workbooks, Labs, Interview Questions, Quiz, VCE exams

Your Feedback, Comments are always Welcomed: info@networkwalks.com


Network Walks Training Academy www.networkwalks.com
8.5 Quantum Computing
• History of Quantum computing

• Classic computing vs Quantum computing


8.6 3D Printing
• History of 3D printing Technology

• How 3D printing works?

• Type of 3D Printing Technologies


3D Printing Options Cheat Sheet
by [deleted] via cheatography.com/2754/cs/16505/

Introd​uction 3D Printing

Our manufa​cturing industry generally – and 3D printing specif​ically –


is driven by innova​tion. Indeed, key techno​logical develo​pments and
new applic​ations in indust​ria​l-grade 3D printing, or additive manufa​‐
ctu​ring, continue to advance this techno​logy, which has only been
around for a little more than 30 years.
Designers and engineers can now choose from several distinct
classes of 3D printing techno​logies. Your choice of “tool” just
depends on what it is you’re designing and what its final applic​ation
is. Here’s a brief roundup of some of the main indust​ria​l-grade 3D
printing options:

Source: https:​//w​ww.m​ed​ica​lde​sig​nan​dou​tso​urc​ing.co​m/3​d-p​rin​tin​g-
o​pti​ons​-me​dic​al-​dev​ice​-de​vel​opment/ Fused Deposition Modeling (FDM)

Fused deposition modeling (FDM) works by feeding a filament or


Stereo​lit​hog​raphy (SL) spool of plastic into a heated nozzle, which then extrudes successive
Stereo​lit​hog​raphy (SL) uses an ultrav​iolet laser that draws on the layers of thermo​pla​stics onto the workpiece. FDM offers a wide
surface of a liquid thermoset resin to create thousands of thin layers thermo​plastic material selection and is leveraged for iterative protot​‐
until final parts are formed. SL is used to create concept models, yping.
cosmetic protot​ypes, and complex parts with intricate geomet​ries.
Continuous Liquid Interface Production (CLIP)
Selective laser sintering (SLS) Carbon is the name of the company that is using a process called
Selective laser sintering (SLS) uses a CO2 laser that lightly fuses CLIP, Continuous Liquid Interface Produc​tion, which builds parts
nylon-​based powder, layer by layer, until final thermo​plastic parts are from the top down, unlike other additive techno​logies that work from
created. SLS produces accurate prototypes and functional production the bottom up. Final plastic parts exhibit excellent mechanical
parts. properties and surface finishes.

Direct metal laser sintering (DMLS) Multi Jet Fusion

Direct metal laser sintering (DMLS) uses a fiber laser system that Multi Jet Fusion process select​ively applies fusing and detailing
draws onto a surface of atomized metal powder, welding the powder agents across a bed of nylon powder, which are fused in thousands
into fully dense metal parts. DMLS builds fully functional metal of layers by heating elements into a solid functional component.
prototypes and production parts and works well to reduce metal Final parts exhibit improved surface roughness, fine feature resolu​‐
components in multipart assemb​lies. tion, and more isotropic mechanical properties when compared to
processes like SLS.
PolyJet

PolyJet uses a jetting process in which small droplets of liquid


photop​olymer are sprayed from multiple jets onto a build platform
and cured in layers that form elasto​meric parts. PolyJet produces
multi-​mat​erial prototypes with flexible features at varying durometers
and is often used to concept overmo​lding designs.

By [deleted] Published 23rd December, 2018. Sponsored by Readable.com


cheatography.com/deleted- Last updated 23rd December, 2018. Measure your website readability!
2754/ Page 1 of 1. https://readable.com
8.7 5th Generation Network (5G)
• History of 5G network

• Advantages of 5G network
8.8 Extended Reality (XR)
Ch
ea
t sh
ee
t
Immersive Tech
Immersive technologies create experiences by merging the physical world with a digital / simulated
reality. Augmented reality (AR) and virtual reality (VR) are the two main types of immersive tech. AR
blends computer-generated information onto the user’s real environment. VR uses computer-
generated information to provide a full sense of immersion.

Overview Challenge Solution Results

Use of virtual reality to


train construction Training occurred
Due to Covid-19, workers to use a remotely using VR for
restrictions a forklift remotely. the first two days, then
company who onsite with the forklift
Also being used to deliver
VR versus Covid-19 - delivered forklift driver Health & Safety training to oil on the 3rd. This
Continuing business training was unable to rig workers before they are improved productivity,
remotely. provide practical deployed. as the forklift became
instruction sessions. VR is being used to show available for work
potential apprentices what purposes for two days,
experience they would receive when it was previously
without distracting workers
from their activities. required for training.

They were able to


An interior design survive Covid-19 by
consultancy was Customers used AR to bringing their business
Adding value through struggling with a lack take measurements, online. Also, they can
AR – Business growth of business due to and designers were now access a wider
in a post Covid-19 Covid-19 restrictions. able to create and client base and scale
environment. Designers were place items within their business, as with
unable to visit their 3D world. AR their market is no
customers' homes. longer restricted by
distance.

Will reach millions,


An interactive AR app bring people together
to create a huge scale and showcase UK
Creating Tourist Tourists, day trippers, solar system trail. creativity globally.
Destinations using AR and locals are all There is also an app Read more about
– increasing visitor looking for new for those following the Northern Ireland's
numbers & revenue. experiences. trail giving information Our Place in Space -
on the planets and a 3 dimensional
solar system. sculpture trail,
interactive AR app!

The Digital Surge Programme is part funded by


Invest Northern Ireland and the European Regional
Development Fund under the Investment for Growth
& Jobs Northern Ireland (2014-2020) Programme.
8.9 Blockchain
Ch
ea
t sh
ee
t
Blockchain
Blockchain is a shared, immutable ledger for recording transactions and tracking assets in a business network.
With blockchain technology, companies can establish a transparent supply chain, verify the authenticity of
product claims and gain consumer trust.

Overview Challenge Solution Results

Able to view the origin


of raw materials, have
an estimated delivery
of parts from 30
different suppliers,
Each supplier was and guarantee
Connecting through How to track devices invited to the provenance of medical
Blockchain – through a supply chain blockchain, added grade products. This
Increasing in the medical devices data about the parts led to a significant
productivity and sector? they were making, and increase in
profitability. added status. productivity,
approximately 5-10%
increase in turnover
due to increased
capacity, and 70%
reduction in manual
report creation.

Using blockchain
enhanced the
transparency of
product information,
reduced food fraud,
and waste. It
Used blockchain increased food safety
technology and and profitability
Food and Beverage How to create a safer included a QR code on among food suppliers.
Traceability – and more transparent food and beverage And it improved
Showcasing the truth supply chain? products to allow the consumer information,
of your product. customer to see boosting trust & sales.
exactly how the
product was made. See how a County
Down enterprise uses
blockchain to reveal
everything that a
consumer would
want to know about
their beer!
8.10 Cyber Security
Cyber Security
Quick Reference Guide Free Cheat Sheets
Visit ref.customguide.com

Security Risks Passwords

Businesses worldwide are at risk for security breaches. While large, The first line of defense in maintaining
well-known companies seem like a likely target, small and medium- system security is using complex
sized organizations and individuals are also at risk. There are many passwords. Use passwords that are at least
ways data can be compromised, including viruses, phishing scams, 8 characters long and include a
hardware and software vulnerabilities, and network security holes. combination of numbers, upper and
lowercase letters, and special characters.
Did you know? Hackers have tools that can break easy
passwords in just a few minutes.

How Long Does it Take to


Crack a Password?
11% of U.S. 1 in 5 people have 98% of Only 40% of
adults have had had an email or software adults know how There are 2 kinds of passwords:
personal social media applications to protect
information stolen account hacked are vulnerable themselves online

Confidential Information Simple Complex


Lowercase letters Upper & lowercase
When dealing with security, confidentiality means private information only letters, numbers, &
is never viewed by unauthorized parties. Confidential information special characters
must only be accessible to those authorized to view the sensitive
data. Confidential information includes: Complex passwords are
EXPONENTIALLY
Personal Information Corporate Information
More difficult to crack
• Social Security Number • Processes
Use them!
• Home Address • Customer Lists
• Salary History • Research and Development
• Performance Issues • Business Strategies
• Credit Card Numbers • Objectives and Projections
Here’s how long it takes to crack a
password when it’s simple vs. complex
Firewalls
Characters Password Time to crack
A firewall acts like a security guard ghiouhel 4 hours, 7 min
and prevents unauthorized people 8 ghiouH3l 6 months
or programs from accessing a
houtheouh 4 days, 11 hours
network or computer from the 9 Houtheo!2 1060 years
Internet. There are hardware-
based firewalls, which create a ghotuhilhg 112 days
10 gh34uhilh! 1500 years
protective barrier
between internal wopthiendhf 8 years, 3 months
networks and the 11 w3pthi7ndh! 232,800 years
outside world, and
whithgildnzq 210 years
software firewalls, which 12 @hi3hg5ldnq! 15,368,300 years
are often part of your
!
operating system.
Source: mywot.com

© 2021 CustomGuide, Inc.


Malware

Malware is short for "malicious software." It is written to infect the host computer. Common types of malware include:

Replicating computer Hijacks your computer or Secretly tracks your Malicious program that
program that infects browser and displays internet activities and tries to trick you into
computers annoying advertisements information running it

Online Browsing Network Security

Browsers communicate to websites with a protocol called HTTP, • Use Wi-Fi password security and
which stands for Hyper Text Transfer Protocol. HTTPS is the secure change the default password
version of HTTP. Websites that use HTTPS encrypt all communication • Set permissions for shared files
between your browser and the site.
• Only connect to known, secure public
Wi-Fi and ensure HTTPS-enabled sites
are used for sensitive data
Secure sites have an indicator, Sites without HTTPS are not • Keep your operating
like a padlock, in the address bar secure and should never be system updated
to show the site is secure. You used when dealing with personal • Perform regular
should always ensure security data. If you are simply reading
security checks
when logging in or transferring an article or checking the
confidential information. weather, HTTP is acceptable. • Browse smart!

Email and Phishing

A phishing email tries to trick consumers into providing confidential data to steal money or information. These emails
appear to be from a credible source, such as a bank, government entity, or service provider. Here are some things to
look for in a phishing email:

Sender’s Address
Grammatical Errors
The address should
Spelling mistakes and
be correlated with the
poor grammar
sender

Generic References Immediate Action


Not being addressed Beware of anything
by your name that calls for urgent
action

Hover Link Attachments


Always check where Never open an
links lead before attachment you
clicking aren’t expecting

© 2021 CustomGuide, Inc.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy