0% found this document useful (0 votes)
24 views

Ap Python

This document provides an overview of key concepts in data preprocessing for machine learning including data cleaning, normalization, and working with NumPy and Pandas. [1] It discusses common data cleaning techniques like dropping rows with missing data, imputing missing values with the mean/median/mode, and issues with simply filling in missing data with placeholders. [2] It then covers different normalization techniques: min-max scaling to scale data between 0-1, z-score normalization to standardize data, and simple feature scaling to divide features by the maximum value. [3] The document also demonstrates basic NumPy operations for creating arrays and common functions, and Pandas operations for reading data, slicing frames, grouping, merging

Uploaded by

mailadwaitharun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Ap Python

This document provides an overview of key concepts in data preprocessing for machine learning including data cleaning, normalization, and working with NumPy and Pandas. [1] It discusses common data cleaning techniques like dropping rows with missing data, imputing missing values with the mean/median/mode, and issues with simply filling in missing data with placeholders. [2] It then covers different normalization techniques: min-max scaling to scale data between 0-1, z-score normalization to standardize data, and simple feature scaling to divide features by the maximum value. [3] The document also demonstrates basic NumPy operations for creating arrays and common functions, and Pandas operations for reading data, slicing frames, grouping, merging

Uploaded by

mailadwaitharun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

🐍

AP PYTHON
Mod 1
Basics:
You would usually use a lot for extra libraries to do all of the pre processing of the data. So far, just NUMPY and Pandas should be
enough. Going further, SKLearn and PYTorch might be used. but those are not important for this mod.
So, what is data pre processing? This is the core of AI&ML as everything runs on data. So when the data is incomplete or
inconsistent or have outliers or have not be standardized. You fix it, that’s data cleaning, which is one half of processing, the other
half being transforming. After that, the data should be ready to be ran thru a learning algorithm of a AI/ML/DL model.

What does pandas do? Basically, it lets you read the file and do basic functions like df.describe() to know the features of the dataset.
Apart from that, almost everything you can do with pandas can also be done with numpy. One draw back with numpy is that it is very
tedious with work with when it comes to strings and datatypes other than int/float. Pandas is more easier to work with textbased
datatypes.
What does NUMPY do? After you read the files using pandas, you do everything else on numpy.

Data cleaning.
There is a lot of data cleaning methods and each will try to fix a different issue in the dataset. The only data cleaning that we want
here is dealing with missing data.
So, there are three ways to do it, you can delete the rows that has missing data, you can fill those missing data with a
mean/median/mode or just fill the missing data with an indicator - aka something like “Missing” (the last method is total trash. don’t
do that.)

import numpy as np
import pandas as pd
from google.colab import files
file = files.upload() #calling the titanic.xls file
df = pd.read_excel("titanic.xls")
df.isnull().sum() #tells you the no of null value per feature.
df.shape #(rows,col)
new = df.dropna() #drops all rows with null.
new2 = df.dropna(subset=['body']) #drops all rows with null value in body col
new.isnull().sum()
new.shape
new2.isnull().sum()
new2.shape

new3=df.fillna(69) #fills all rows with null value with the value of 69 - Nice!
#use inplace=true to make the permanent to the og dataframe.

AP PYTHON 1
# you can go this values = {"Height": 0, "Weight": 1, "Country": 2, "Place": 3, "Number of days":4, "stay":5}
# and make the fillne like this. df.fillna(value=values)

Note : Axis = 0 means rows and Axis = 1 means columns

Normalization
Here, you only would need MinMaxScaler, Z Score and Simple Feature Scaling.
But what is normalization? Will be explained a bit later.

Simple Feature Scaling = you take a column’s elements and then divide it with the max value of that column.

df["colname"]=df["colname"]/df["colname"].max()

MinMaxScaler = you usually do this for Y values (aka outputs of every ML/AL/DL dataset) which gives you an normalized dataset of
range [0,1] or [-1,1] (the highest value will be 1 and the lowest will be 0)

from sklearn.preprocessing import MinMaxScaler


mm1=MinMaxScaler()
mm1.fit_transform("datasetname, which has to be in numpy form")

#or you can take every element and then sub it with the min value of the column and then divide it by the range of the column (max-min value

Z Score Normalization = Standardization.


which just gets rid of the various ranges of diff features. gives you very real number values without any range limits like [0,1]. It is
much less affected by outliers whereas normalization is. (the middle most term will be 0 and then terms greater than the middle term
will be positive and then terms lesser than the middle term will be -ive. NOTE - the middle most term here means in terms of value,
not position of said value)

from sklearn.preprocessing import StandardScaler


ss1=StandardScaler()
ss1.fit_transform("datasetname, which has to be in numpy form")

#or,
df['colname']=(df['colname'] - df['colname'].mean())/df['colname'].std()

Mod 2
Numpy
You define it like this:

import numpy as np
a=np.array([1,2,3,4] , dtype = int)
# or
k=np.arange(0,10,2) #starts from 0 to 10, 0 is inclusive and 10 is not and the step value is 2.
j=np.identity(2) #identity matrix of size 2
t=np.full(5,2) #for 1d. for 2d; t.full((2,2),2) = fills the 2 value of size 5 in 1d case and (2,2) in 2d case

Changing a pandas file to numpy

np_new2 = new2.values
np_new2.shape
np_new2.dtype
np_new2.ndim

AP PYTHON 2
#to get random number or zeros or ones.
z=np.zeros((1,2))
o=np.ones((1,2))

k=np.random.randint(1,5,10) #gives 10 elements in the range of 1,5


p=np.random.rand(2,3) #2,3 is the shape and gives rand values within 0-1
j=np.random.randn(1,2) #1,2 = shape and gives random natural numbers

Indexing in Numpy

#for any array.


<numpyarray_name>[<index>]
print(x[1]) #1d
print(x[1:,:]) #2d.

#operations.
print(x+k) #two diff arrays. or you can use np.add(x,k)

#likewise, there is np.subtract(x,k) np.multiply(x,k) np.divide(x,k) np.reciprocal(x)


#np.power(base,exp) np.mod(reminder, divisor)

Pandas

Basics

#refer the above code snippets to learn how to insert a dataframe file.
#now to understand how to create a pandas file on your own.
e={"sno":[1,2,3,4,5],"Test":[1,2,3,4,5]}
e=pd.DataFrame(e)

Important operations (slicing, group by and joins)

#slicing this shit.


#now there is two slicing functions. loc and iloc and both are more or less do the same
#thing but in different ways. iloc is based on indexes and loc is based on label names.

#to get a single row.


print(e.loc["""rowname, if there is no specific rowname cuz there won't be.
just give the index of the row in int form
(NOT STRING"""]) #or, print(pd.iloc[index value])

#to get a single col


print(e.loc[:,"colname"]) #or, print(pd.iloc[index_of_row,index_of_col])

#to get multiple stuff


e.loc[[1,2],["sno","Test"]] #and iloc is very similar to this as well.

#you can also have step values as well.


print(e.iloc[::2,1:3]) #before, i am was giving lists of exact column names
#here, i am just mentioning the size. ::2 means all the rows with step value of 2.

#a great example:
Report_Card.loc[(Report_Card["Name"] == "Benjamin Duran"),
["Lectures","Grades","Credits","Retake"]]

#Concatenation
pd.concat(objs=[e,v])
#output:
sno Test Test2
0 1 1.0 NaN
1 2 2.0 NaN
2 3 3.0 NaN
3 4 4.0 NaN

AP PYTHON 3
4 5 5.0 NaN
0 1 NaN 1.0
1 2 NaN 2.0
2 3 NaN 3.0
3 4 NaN 4.0
#notice how the index is restarting.

pd.concat(objs=[e,v],ignore_index=True)
#output:
sno Test Test2
0 1 1.0 NaN
1 2 2.0 NaN
2 3 3.0 NaN
3 4 4.0 NaN
4 5 5.0 NaN
5 1 NaN 1.0
6 2 NaN 2.0
7 3 NaN 3.0
8 4 NaN 4.0
#You can get similar results using print(e.append(v, ignore_index=True))

#Joins

p = pd.merge(e,b,on="sno",how="outer")
#p is a new dataframe and e and b are the preexisting dataframes. on is the common col
#in both e and b. how is the type of join.
# all how values: left, right, outer, inner

j=pd.merge(e,v,how="outer",on="Test",sort=True)
#you can also sort them by the on col

#GROUP BY

group1=j.groupby(["sno"])

#other stuff that you can do.


len(group1)
group1["Test"].nunique()
group1.groups #gives you all the groups.
group1.get_group(group_name) #gives that group.
group1["sno"].value_counts()
#you can also do stuff like sum, mean, max, min.
#using these you can solve complex shit like.
group1[group1["test"]==2] #gives all values where test = 2
group1[group1["test"]==2]["sno"].sum() #now you are taking the sum of sno where test = 2

for the_name_of_Group,The_stuff_in_group in group1:


print(the_name_of_Group)
print()
print(The_stuff_in_group)

AP PYTHON 4
print()
#to go thru the groups.

#there also something called. agg function


#example : g1.agg({"Courses": "count", "Fee": "min", "Duration": "min", "Discount": "min"})
#there is count, sum, min, max, mean, median,average,product,std and var

Pandas Data Structures.

#series.
s = pd.Series() #that's empty

k=np.array([1,2,3,4])
kdf=pd.Series(k,index=[1,2,3,4]) #this is just a way to turn numpy stuff to pandas.
#index is not required but it is nice to have,
#you can also use dictionary for this where the key will be the index.

#scalar creation, you just create a pandas series with just one value.
k=69
kdf=pd.Series(k,index=[1,2,3,4,5]) #index is requried here as it says the size.

#you can get the values like normal slicing.


kdf[1:]

#or you can use conditions too.


kdf[kdf<3]

#between works too.


data_c['Salary'].between(12000, 20000, inclusive=True) #that is not series but dataframe.

#to delete a col from dataframs


del op["Test"]

#and you can use this too.


data_pt.describe()

#and you can turn the dataframe back to csv


dataframename.to_csv('your_file_name.csv')

CA3
mod 3 and 4

Mod 4
Linear regression - XW+B (perceptron)(X - the independent variable, W - Slope mamaaiyo. B - bias (weights))
there could be another error term, which would end up in an equ like this (XW + B +- C) (that +- means plus or minus
error sum of residual = sum of (Y-true - Y-pred)^2
and sum of squares = sum of (y^2) - ((sum of Y^2)/n)
degree of freedom - learn after this.

Logistical regression = So linear reg works for real number values, what if you want to go into classifications into categorical data?
(basically sigmoid, which is a type of logisitical regression)

the important terms:


the prob of an event = p/1-p (would give you a lot of weird values. and you don’t want to have weird values right? so you plug a log
into that bitch, making it log(p/1-p) and that will give the proper prob of an event.)

AP PYTHON 5
z = XW+B
and XB+B = log(p/1-p)

and sigmoid = 1/1+e^(-z) = that also gives you the prob of an event.

classification vs regression = you know this shit alr.


important models of classification; log reg, K-nearest neighbour, support vector machines, decision tree classifications (can be split
into binary and mulitple class)
important models of regerssion; linear reg, poly reg, support vector reg, decision tree reg. (can be split into linear regression and
non linear regression.)

Suppot vector machine


it works for both linear and non linear separatable data, but only linear separatable data is on the portions. so we’ll study only that.
so you have a bunch of points on the graph which are linearly separatable and then you want to draw a line inbetween which has
the biggest distance between the points. and that line is called max margin line. there can be other lines which could separate but
with less disstance than the max margine line, so if you combine all of those lines, you get the hyperplane.

so how do you find the SVM if you are given a set of values?
so you plot the points and then take the closet points to the hyperplane.
so lets assume the pts are (1,2) and (3,4)
now you have sv1 (1,2) and sv2 (3,4) and now you are assuming the bias of 1, so you add that, sv1(1,2,1) and sv2(3,4,1)

now you got d1 * sv1 + d2 * sv2


now you muliplte that by sv1 and sv2 and those values equal to the polarity of those sv (if you mulitple by sv1, and if sv1 is true/yes,
then that equ equals 1 and if sv2 is false/no, sv2’s equ equals to -1)

1. d1 (1,2,1)(1,2,1) + d2(3,4,1)(1,2,1) = 1

2. d1 (1,2,1)(3,4,1) + d2(3,4,1)(3,4,1) = -1

so it equals to

1. d1(1,4,1) + d2(3,8,1) = 1

2. d1(3,8,1) + d2(9,16,1) =-1

now do dot product.


d1(1+4+1) + d2(3+8+1) = 1 (like that.)
so, 1. 6d1 +12d2 = 1

2. 12d1 +26d2 = -1

now multiple equ1 by 2 and solve normally.

12d1 + 24d2 = 2
(-)12d1 (-) 26d2 = 1
so that gives you -2d2 = 3 ⇒ d2 = -3/2 = -1.5 (3/2 in the negative side.)
12d1 + 24(-3/2) = 1
d1 = 37/12 = 3.08 (for the positive side.)
now get the weights: sum of Di * SVi

AP PYTHON 6
so;
3.08(1,4,1) + -1.5(3,4,1)
(3.08,12.32, 3.08) + (-4.5,-6,-1.5)
=(-1.42, 6.32, 1.58), so the actual bias is 1.58 and w1 = -1.42 and w2 = 1.58

so the prediction formal is w1x1 + w2x2 + b

Accuracy
put the confusion matrix as usual and follow these formulas.
precision = tp/fp+tp (note: tp= true postive and fp = false postive)
recall = tp/tp+fn

accuary = tp+tn / n
error rate = fp+fn / n
sensitivity = recall.

specificity = recall but inverse the values. (tn/tn+fp)

KNN and KMean


Kmean, so you take a cluster and then check all of the points in that cluster and if they all (or almost all of them) point to one class,
then future points which land in that cluster would point to the same class.

AP PYTHON 7
you get the points and then sub it to the cluster’s points and then which has the least distance is the correct cluster. (mod the value
is it’s -ive)
and then you take the mid point of all of the points in that cluster to form the new points of the cluster.

AP PYTHON 8
now redo the whole thingy with the new cluster points.
redo till you are satisfied
KNN - you find out the nearest elements/clusters and then who has the least distance from all of the pre-existing, you just assign
your test element to the class of the pre-existing element with the least distance. that’s all

Mod 3 (Data visualization with MatPlotLib)


always import matplotlib.pyplot as plt.
and import numpy as np (just to be sure.)
to draw lines.

xaxis = numpy.array([0,10])
yaxis = numpy.array([1,100])
plt.plot(xaxis,yaxis)
plt.show()
#draws a line from o,o to 10,100

plt.plot(x, y)

AP PYTHON 9
# Set limits and labels
plt.xlim([0, 10])
plt.ylim([-1, 1])
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')

to draw it without lines.

plt.plot(xaxis,yaxis,*) #marks the points with *


#if xaxis is not given, then it will take the index values of the yaxis array as the xaxis
#plt.scatter()
#plt.bar()
#plt.hist()
#plt.boxplot()
#or just use kind = 'line'/or wtv you wanna add. and title gives it title

All marks:
o - circle
‘*’ - star.
‘-’ - lines.
‘—’ - dotted lines.
‘.’ - point. (dot)
‘,’ - pixel (comma)
‘+’ - plus
‘s’ - square.
‘D’ - diamond
‘d’ - pentagon
‘H’ - hexagon

‘V’ -triangle down.


colour:
lets say you want the line to be in red; you type out -:r
white,k - white,black
r,g,b - red, green, blue.
c,m,y - cyan, magneta, yellow.

With Seaborn.
now you import seaborn as sns.

sns.distplot([1,2,3,4], hist=False)
plt.show()
#shows you the distribution of any one numerical col
#or you can go like this:
sns.lineplot(x='col1',y='col2', data = data_variable)
plt.show()
# you can even use scatterplot(), histplot(), barplot(), hue parameter = diff colour for diff values.

kde plot = i have no idea tbh, check that later on.

sns.jointplot(x='col1',y='col2', data = data_variable)

AP PYTHON 10
plt.show()

sns.pairplot(data_variable)
and that plots all the possible combinations of plot there is for all of the variables columns
stripplot(x,y) is for one categorical and one numerical values.

swarmplot(x,y) is basically the same as stripplot but with less overlapping


violinplot(x,y) is like whisper and box plot. (similar)
countplot(x) takes only x value.

#for plt subplots.


fig, ax = plt.subplots(2, 2)
ax[0][1].plot(what ever you want)
#the 0 and 1 will denote the indexes of the plot.

Mod 5 (Tensorflow an
d Basics of ML)
Tensorflow is a library which is mainly used to handle high complexity data and perform basic ML tasks (mostly for coding up math
functions which we don’t need to understand at the moment) and many high level ML/DL (and even AI) libraries like Keras, PyTorch
are built ontop of Tensorflow (It is the library which using CPU and GPU). The syntax sort of resemble C++.

Importing the library and a basic code:

import tensorflow as tf
with tf.compat.v1.Session() as s: #to run code
a=tf.constant(60)
b=tf.constant(9)
nice = tf.add(a,b)
print(nice) #the output is Tensor("Add_1:0", shape=(), dtype=int32), but you don't want it like that, right?
print(s.run(nice)) #so the run function will give the output as 69 of type numpy (not numpy array, just one numpy value)

Variables

with tf.compat.v1.Session() as s:
test_var = tf.Variable(tf.zeros([3,2])) #here, you are creating a variable, which can be changed unlike constants
var_inst = tf.compat.v1.global_variables_initializer() #and you are initializing the said variables
s.run(var_inst)
print(s.run(test_var))
test_var = (tf.zeros([5,2]))
print(s.run(test_var))

#another example code:


tv1=tf.Variable([1,2,3,4,5])
print(tv1)
tv1[1].assign(103) #you change the index 1 with 103
print(tv1)
tv1.assign_sub([1,2,3,4,5]) #you are subtracting the values given in the list to the actual tv1 list.
print(tv1)

And i just got to know that tensor flow is not in the portions for this current finals, so don’t study that. Moving on to the sklearn stuff

For linear and log regression:

AP PYTHON 11
import the numpy data and then pandas to fetch the dataset and then preprocess them and after you done, you split then into
traning and testing.

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

using sklearn.preprocessing module, import standardscaler or minmaxscaler and then normalize them if required.

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(X_train, y_train)
#to predict
results = regressor.predict(X_test)
pred=scaler.inverse_transform(results)
#now you can do the accuracy check.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explaine
d_variance_score, accuracy_score
print(mean_squared_error(Y_test,pred)) #or you can do anything other than that.

for Log reg.

#the following code from above:


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
#to
from sklearn.linear_model import LogisticRegression
regressor = LogisticRegression()
#and everything else is the same, all you need to do is change the functions

from sklearn.metrics import classification_report, confusion_matrix


cm = confusion_matrix(y_train, regressor.predict(X_train))
print(cm)
# ^ for confusion matrix

#for svm, keep everything literally the same as log reg but change the function.
from sklearn.svm import SVC
model=SVC()
model.fit(x_train, y_train)

#for KNN, there is one important parameter called n_neighbors which is the number of neighbours that the model is going to use, which has a
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=3)

classifier.fit(x_train, y_train)

#For KMeans, you have n_clusters which is basically the same as the n_neighbors by for numbr of clusters. (default=8)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = i)
kmeans.fit(X_Train,Y_Train)

Learn how to plot stuff along finding the values for these models and you are set.

AP PYTHON 12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy