Ap Python
Ap Python
AP PYTHON
Mod 1
Basics:
You would usually use a lot for extra libraries to do all of the pre processing of the data. So far, just NUMPY and Pandas should be
enough. Going further, SKLearn and PYTorch might be used. but those are not important for this mod.
So, what is data pre processing? This is the core of AI&ML as everything runs on data. So when the data is incomplete or
inconsistent or have outliers or have not be standardized. You fix it, that’s data cleaning, which is one half of processing, the other
half being transforming. After that, the data should be ready to be ran thru a learning algorithm of a AI/ML/DL model.
What does pandas do? Basically, it lets you read the file and do basic functions like df.describe() to know the features of the dataset.
Apart from that, almost everything you can do with pandas can also be done with numpy. One draw back with numpy is that it is very
tedious with work with when it comes to strings and datatypes other than int/float. Pandas is more easier to work with textbased
datatypes.
What does NUMPY do? After you read the files using pandas, you do everything else on numpy.
Data cleaning.
There is a lot of data cleaning methods and each will try to fix a different issue in the dataset. The only data cleaning that we want
here is dealing with missing data.
So, there are three ways to do it, you can delete the rows that has missing data, you can fill those missing data with a
mean/median/mode or just fill the missing data with an indicator - aka something like “Missing” (the last method is total trash. don’t
do that.)
import numpy as np
import pandas as pd
from google.colab import files
file = files.upload() #calling the titanic.xls file
df = pd.read_excel("titanic.xls")
df.isnull().sum() #tells you the no of null value per feature.
df.shape #(rows,col)
new = df.dropna() #drops all rows with null.
new2 = df.dropna(subset=['body']) #drops all rows with null value in body col
new.isnull().sum()
new.shape
new2.isnull().sum()
new2.shape
new3=df.fillna(69) #fills all rows with null value with the value of 69 - Nice!
#use inplace=true to make the permanent to the og dataframe.
AP PYTHON 1
# you can go this values = {"Height": 0, "Weight": 1, "Country": 2, "Place": 3, "Number of days":4, "stay":5}
# and make the fillne like this. df.fillna(value=values)
Normalization
Here, you only would need MinMaxScaler, Z Score and Simple Feature Scaling.
But what is normalization? Will be explained a bit later.
Simple Feature Scaling = you take a column’s elements and then divide it with the max value of that column.
df["colname"]=df["colname"]/df["colname"].max()
MinMaxScaler = you usually do this for Y values (aka outputs of every ML/AL/DL dataset) which gives you an normalized dataset of
range [0,1] or [-1,1] (the highest value will be 1 and the lowest will be 0)
#or you can take every element and then sub it with the min value of the column and then divide it by the range of the column (max-min value
#or,
df['colname']=(df['colname'] - df['colname'].mean())/df['colname'].std()
Mod 2
Numpy
You define it like this:
import numpy as np
a=np.array([1,2,3,4] , dtype = int)
# or
k=np.arange(0,10,2) #starts from 0 to 10, 0 is inclusive and 10 is not and the step value is 2.
j=np.identity(2) #identity matrix of size 2
t=np.full(5,2) #for 1d. for 2d; t.full((2,2),2) = fills the 2 value of size 5 in 1d case and (2,2) in 2d case
np_new2 = new2.values
np_new2.shape
np_new2.dtype
np_new2.ndim
AP PYTHON 2
#to get random number or zeros or ones.
z=np.zeros((1,2))
o=np.ones((1,2))
Indexing in Numpy
#operations.
print(x+k) #two diff arrays. or you can use np.add(x,k)
Pandas
Basics
#refer the above code snippets to learn how to insert a dataframe file.
#now to understand how to create a pandas file on your own.
e={"sno":[1,2,3,4,5],"Test":[1,2,3,4,5]}
e=pd.DataFrame(e)
#a great example:
Report_Card.loc[(Report_Card["Name"] == "Benjamin Duran"),
["Lectures","Grades","Credits","Retake"]]
#Concatenation
pd.concat(objs=[e,v])
#output:
sno Test Test2
0 1 1.0 NaN
1 2 2.0 NaN
2 3 3.0 NaN
3 4 4.0 NaN
AP PYTHON 3
4 5 5.0 NaN
0 1 NaN 1.0
1 2 NaN 2.0
2 3 NaN 3.0
3 4 NaN 4.0
#notice how the index is restarting.
pd.concat(objs=[e,v],ignore_index=True)
#output:
sno Test Test2
0 1 1.0 NaN
1 2 2.0 NaN
2 3 3.0 NaN
3 4 4.0 NaN
4 5 5.0 NaN
5 1 NaN 1.0
6 2 NaN 2.0
7 3 NaN 3.0
8 4 NaN 4.0
#You can get similar results using print(e.append(v, ignore_index=True))
#Joins
p = pd.merge(e,b,on="sno",how="outer")
#p is a new dataframe and e and b are the preexisting dataframes. on is the common col
#in both e and b. how is the type of join.
# all how values: left, right, outer, inner
j=pd.merge(e,v,how="outer",on="Test",sort=True)
#you can also sort them by the on col
#GROUP BY
group1=j.groupby(["sno"])
AP PYTHON 4
print()
#to go thru the groups.
#series.
s = pd.Series() #that's empty
k=np.array([1,2,3,4])
kdf=pd.Series(k,index=[1,2,3,4]) #this is just a way to turn numpy stuff to pandas.
#index is not required but it is nice to have,
#you can also use dictionary for this where the key will be the index.
#scalar creation, you just create a pandas series with just one value.
k=69
kdf=pd.Series(k,index=[1,2,3,4,5]) #index is requried here as it says the size.
CA3
mod 3 and 4
Mod 4
Linear regression - XW+B (perceptron)(X - the independent variable, W - Slope mamaaiyo. B - bias (weights))
there could be another error term, which would end up in an equ like this (XW + B +- C) (that +- means plus or minus
error sum of residual = sum of (Y-true - Y-pred)^2
and sum of squares = sum of (y^2) - ((sum of Y^2)/n)
degree of freedom - learn after this.
Logistical regression = So linear reg works for real number values, what if you want to go into classifications into categorical data?
(basically sigmoid, which is a type of logisitical regression)
AP PYTHON 5
z = XW+B
and XB+B = log(p/1-p)
and sigmoid = 1/1+e^(-z) = that also gives you the prob of an event.
so how do you find the SVM if you are given a set of values?
so you plot the points and then take the closet points to the hyperplane.
so lets assume the pts are (1,2) and (3,4)
now you have sv1 (1,2) and sv2 (3,4) and now you are assuming the bias of 1, so you add that, sv1(1,2,1) and sv2(3,4,1)
1. d1 (1,2,1)(1,2,1) + d2(3,4,1)(1,2,1) = 1
2. d1 (1,2,1)(3,4,1) + d2(3,4,1)(3,4,1) = -1
so it equals to
1. d1(1,4,1) + d2(3,8,1) = 1
2. 12d1 +26d2 = -1
12d1 + 24d2 = 2
(-)12d1 (-) 26d2 = 1
so that gives you -2d2 = 3 ⇒ d2 = -3/2 = -1.5 (3/2 in the negative side.)
12d1 + 24(-3/2) = 1
d1 = 37/12 = 3.08 (for the positive side.)
now get the weights: sum of Di * SVi
AP PYTHON 6
so;
3.08(1,4,1) + -1.5(3,4,1)
(3.08,12.32, 3.08) + (-4.5,-6,-1.5)
=(-1.42, 6.32, 1.58), so the actual bias is 1.58 and w1 = -1.42 and w2 = 1.58
Accuracy
put the confusion matrix as usual and follow these formulas.
precision = tp/fp+tp (note: tp= true postive and fp = false postive)
recall = tp/tp+fn
accuary = tp+tn / n
error rate = fp+fn / n
sensitivity = recall.
AP PYTHON 7
you get the points and then sub it to the cluster’s points and then which has the least distance is the correct cluster. (mod the value
is it’s -ive)
and then you take the mid point of all of the points in that cluster to form the new points of the cluster.
AP PYTHON 8
now redo the whole thingy with the new cluster points.
redo till you are satisfied
KNN - you find out the nearest elements/clusters and then who has the least distance from all of the pre-existing, you just assign
your test element to the class of the pre-existing element with the least distance. that’s all
xaxis = numpy.array([0,10])
yaxis = numpy.array([1,100])
plt.plot(xaxis,yaxis)
plt.show()
#draws a line from o,o to 10,100
plt.plot(x, y)
AP PYTHON 9
# Set limits and labels
plt.xlim([0, 10])
plt.ylim([-1, 1])
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
All marks:
o - circle
‘*’ - star.
‘-’ - lines.
‘—’ - dotted lines.
‘.’ - point. (dot)
‘,’ - pixel (comma)
‘+’ - plus
‘s’ - square.
‘D’ - diamond
‘d’ - pentagon
‘H’ - hexagon
With Seaborn.
now you import seaborn as sns.
sns.distplot([1,2,3,4], hist=False)
plt.show()
#shows you the distribution of any one numerical col
#or you can go like this:
sns.lineplot(x='col1',y='col2', data = data_variable)
plt.show()
# you can even use scatterplot(), histplot(), barplot(), hue parameter = diff colour for diff values.
AP PYTHON 10
plt.show()
sns.pairplot(data_variable)
and that plots all the possible combinations of plot there is for all of the variables columns
stripplot(x,y) is for one categorical and one numerical values.
Mod 5 (Tensorflow an
d Basics of ML)
Tensorflow is a library which is mainly used to handle high complexity data and perform basic ML tasks (mostly for coding up math
functions which we don’t need to understand at the moment) and many high level ML/DL (and even AI) libraries like Keras, PyTorch
are built ontop of Tensorflow (It is the library which using CPU and GPU). The syntax sort of resemble C++.
import tensorflow as tf
with tf.compat.v1.Session() as s: #to run code
a=tf.constant(60)
b=tf.constant(9)
nice = tf.add(a,b)
print(nice) #the output is Tensor("Add_1:0", shape=(), dtype=int32), but you don't want it like that, right?
print(s.run(nice)) #so the run function will give the output as 69 of type numpy (not numpy array, just one numpy value)
Variables
with tf.compat.v1.Session() as s:
test_var = tf.Variable(tf.zeros([3,2])) #here, you are creating a variable, which can be changed unlike constants
var_inst = tf.compat.v1.global_variables_initializer() #and you are initializing the said variables
s.run(var_inst)
print(s.run(test_var))
test_var = (tf.zeros([5,2]))
print(s.run(test_var))
And i just got to know that tensor flow is not in the portions for this current finals, so don’t study that. Moving on to the sklearn stuff
AP PYTHON 11
import the numpy data and then pandas to fetch the dataset and then preprocess them and after you done, you split then into
traning and testing.
using sklearn.preprocessing module, import standardscaler or minmaxscaler and then normalize them if required.
#for svm, keep everything literally the same as log reg but change the function.
from sklearn.svm import SVC
model=SVC()
model.fit(x_train, y_train)
#for KNN, there is one important parameter called n_neighbors which is the number of neighbours that the model is going to use, which has a
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(x_train, y_train)
#For KMeans, you have n_clusters which is basically the same as the n_neighbors by for numbr of clusters. (default=8)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = i)
kmeans.fit(X_Train,Y_Train)
Learn how to plot stuff along finding the values for these models and you are set.
AP PYTHON 12