6 - Python Crash Course - 15oct20
6 - Python Crash Course - 15oct20
6 - Python Crash Course - 15oct20
Content
• Introduction to Python:
– https://www.python.org/about/gettingstarted/
• Introduction to Numpy: The fundamental package for scientific computing with Python
– https://numpy.org/devdocs/user/quickstart.html
• Introduction to Scikit‑learn: python library for machine learning. Exercises on Naive Bayes and
Linear Regression
– https://scikit‑learn.org/stable/getting_started.html
Introduction to python
Installation
Development
Libraries
• they can be managed through conda or python’s own package manager: pip
pip install numpy
import numpy as np
np.random.random((2,3))
• if statements
• for loops
• other useful statements (break, continue, pass, …) and functions (range, enumerate, zip)
a = 5
b = None
if a > 5:
print("a>5")
elif a == 5:
print("a=5")
else:
print("a < 5")
for a in range(5):
print(a)
a=5
0
1
2
3
4
• dynamic typing
x = 3
#type(x)
x = 'Hello'
type(x)
if isinstance(x, str):
print("This is a string")
else:
print(type(x))
This is a string
b.pretty_print()
Hello World
---------------------------------------------------------------------------
<ipython-input-11-f4028c4d4802> in <module>
1 # strings are immutable
----> 2 string[5] = "_"
Hello World
Hello World
Hello World 1
# predefined methods
splitted = string.split(" ")
'_'.join(splitted)
'Hello_World'
print(euclidean_distance(0,0,1,1))
1.4142135623730951
Introduction to numpy
a =
[[1 2 3]
[4 5 6]]
Shape: (2, 3)
• useful tools for working with arrays (e.g: element‑wise multiplication with lists vs numpy arrays)
# lists:
a = b = [[1,2],[3,4]]
c = [[0,0],[0,0]]
for i in range(len(a)):
for j in range(len(b)):
c[i][j] = a[i][j] * b[i][j]
# numpy arrays
a = b = np.array([[1,2],[3,4]])
c = a * b
#a[:,1]
#a[1,:]
[[1 2]
[3 4]]
• operations:
– element‑wise operations
– broadcasting (operations between different arrays of different sizes)
print(a+b)
print(a-b)
print(a*b)
[[2 4]
[6 8]]
[[0 0]
[0 0]]
[[ 1 4]
[ 9 16]]
print(a+1)
print(a*2)
[[2 3]
[4 5]]
[[2 4]
[6 8]]
cat = plt.imread('cat.jpg')
print("Type", type(cat), "Shape", cat.shape)
plt.imshow(cat)
<matplotlib.image.AxesImage at 0x7fa621eedf70>
type(cat)
h, w = cat.shape[0], cat.shape[1]
side = int(0.5*(min(h,w)))
i = random.randint(0,h-side)
j = random.randint(0,w-side)
cat_crop = cat[i:i+side,j:j+side]
plt.imshow(cat_crop)
<matplotlib.image.AxesImage at 0x7fa62068ba00>
(490, 735, 3)
(490, 735)
<matplotlib.image.AxesImage at 0x7fa620324610>
Introduction to scikit‑learn
Naive Bayes
Problem:
• we have a set of points with 2 dimensions. Each point is assigned a label between 0 and 1;
• we want to build a Gaussian Naive Bayes classifier able to predict the label of new points.
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, random_state=2, centers=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');
𝑃 (features|𝐿)𝑃 (𝐿)
𝑃 (𝐿|features) =
𝑃 (features)
If we use a Gaussian Naive Bayes approach we make the assumption that data from each label is drawn
from a simple Gaussian distribution. So we need to estimate for each label the mean and the variance
for the values of the various features. We also assume that there is no covariance between the features.
mask_l0 = y==0
mask_l1 = y==1
mean_l0 = X[mask_l0].mean(0)
mean_l1 = X[mask_l1].mean(0)
var_l0 = X[mask_l0].var(0)
var_l1 = X[mask_l1].var(0)
print("Mean l0: {}. l1: {}".format(mean_l0, mean_l1))
print("Var l0: {}. l1: {}".format(var_l0, var_l1))
Now when we receive a new point we can compute the probability for the two classes and make a
prediction.
from scipy.stats import norm
def prob_l0(point):
return dist_l0_f0.pdf(point[0])*dist_l0_f1.pdf(point[1])
def prob_l1(point):
return dist_l1_f0.pdf(point[0])*dist_l1_f1.pdf(point[1])
def predict_p(point):
result = prob_l0(point)/prob_l1(point)
if result > 1:
return 0
return 1
def predict_set(points):
points = [predict_p(point) for point in points]
return np.array(points)
ynew = model.predict(Xnew)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
Linear regression
Problem:
𝑦 = 𝑎 0 + 𝑎1 𝑥
n_points=100
x = 10 * np.random.rand(n_points)
y = 2 * x - 5 + np.random.randn(100)
plt.scatter(x, y);
We can use scikit-learn’s LinearRegression estimator to fit our data and construct the best‑fit
line
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
We can plot the estimated line together with the training data:
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit.reshape(-1,1))
plt.scatter(x, y)
plt.plot(xfit, yfit, color='red');
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ...
In the name Linear Regression the term linear refers to the fact that the coefficients 𝑎𝑛 never multiply
or divide each other.
In order to apply our Linear Regression model in this situation we have to transform our data. In
practice we consider a multidimensional linear model:
𝑦 = 𝑎0 + 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ...
In our problem we consider data generated using a 𝑠𝑖𝑛 function with added noise. This is an useful
example as there is a global and general regularity (that we wish to learn) but local observations are
corrupted by noise.
x = np.random.rand(n_points)
y = np.sin(2 * np.pi * x) + 0.1 * np.random.randn(n_points)
plt.scatter(x=x, y=y)
plt.show()
To solve the problem we want to use both the PolynomialFeatures transformer and the
LinearRegression model. We can use the make_pipeline function provided by scikit-learn
so that we do not need to perform the data transformation and model fitting in two separated steps
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
poly_model = make_pipeline(PolynomialFeatures(7),
LinearRegression())
poly_model.fit(x.reshape(-1,1), y);
yfit = poly_model.predict(xfit.reshape(-1,1))
plt.scatter(x, y)
lim = plt.axis()
plt.plot(xfit, yfit, color='red');
plt.axis(lim);
import sklearn
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
print(X.shape)
(1797, 64)
plt.show()
Normalize data: many machine learning algorithms work better on standardized data (0 meand and
unit variance)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model.fit(X_train, y_train);
[6]
<matplotlib.image.AxesImage at 0x7fa618f89b80>
Accuracy: 0.74
References: