0% found this document useful (0 votes)

23 views

Part A Assignment 6

The document discusses various techniques for preprocessing data including data cleaning, integration, transformation, error correction and model building. It describes tasks like data cleaning through handling missing values, outliers and inconsistent data. Model building is discussed along with importance of preprocessing before building models.

Uploaded by

sujeetpawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Part A Assignment 6

Uploaded by

sujeetpawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

A

ssignment: 6 AIM : To Perform

Data Preprocessing and model
building. PROBLEM
STATEMENT /DEFINITION
Perform the following operations using Python on the Air quality and Heart Diseases
data sets
1. Data cleaning
2. Data integration
3. Data transformation
4. Error correcting
5. Data model building

OBJECTIVE
● To understand the concept of Data Preprocessing
● To understand the methods of Data preprocessing.

THEORY:
Data Preprocessing:

Data preprocessing is the process of transforming raw data into an

understandable format. It is also an important step in data mining as we
cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.

Preprocessing of data is mainly to check the data quality. The quality

can be checked by the following

● Accuracy: To check whether the data entered is correct or not.

● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the
places that do or do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:

1. Data cleaning
2. Data integration
3. Data transformation
4. Error correcting
5. Data model building

Let's discuss each of these points in detail:

Before Data Preprocessing. The necessary libraries and dataset have to be imported.
i
m
p
o
r
t
p
a
n
d
a
s
a
s
p
d
i
m
p
o
r
t
n
u
m
p
y
a
s
n
p
import
matplotlib.py
plot as plt
import
seaborn as
sns
%matplotlib inline
url='/content/drive/
MyDrive/heart.csv'
df=pd.read_csv(url)

1. Data cleaning:
Data Cleaning means the process of identifying the incorrect,
incomplete, inaccurate, irrelevant or missing part of the data and then
modifying, replacing or deleting them according to the necessity. Data
cleaning is considered a foundational element of basic data science.

Different Ways of Cleaning Data

Now let’s take a closer look in the different ways of cleaning data.

Changing the Datatype of the columns:

The variables types are in heart dataset are:

● Binary: sex, fbs, exang, target
● Categorical: cp, restecg, slope, ca, thal

● Continuous: age,
trestbps, chol, thalac,
oldpeak But, df.info()
gives the following
output.
<class
'pandas.core.frame.
DataFrame'>
RangeIndex: 303
entries, 0 to 302
Data columns (total
14 columns):
# Column Non-Null
Count Dtype
- -
-
1. age 303 non-null int64
2. sex 303 non-null int64
3. cp 303 non-null int64
4. trestbps 303 non-null int64
5. chol 303 non-null int64
6. fbs 303 non-null int64
7. restecg 303 non-null int64
8. thalach 303 non-null int64
9. exang 303 non-null int64
10. oldpeak 303 non-null float64
11. slope 303 non-null int64
12. ca 303 non-null int64
13. thal 303 non-null int64
14. target
303 non-null
int64 dtypes:
float64(1),
int64(13)
memory usage:
33.3 KB

Hence, We will change the Datatypes of columns to appropriate types.

df['sex'] = df['sex'].astype('object')
df['cp'] = df['cp'].astype('object')
df['fbs'] =
df['fbs'].astype('object')
df['restecg'] =
df['restecg'].astype('object
') df['exang'] =
df['exang'].astype('object')
df['slope'] = df['slope'].astype('object')
df['ca'] = df['ca'].astype('object')
df['thal'] =
df['thal'].astype('obj
ect') df.dtypes

Inconsistent column :

If your DataFrame contains columns that are irrelevant or you are never
going to use them then you can drop them to give more focus on the
columns you will work on. Let’s see an example of how to deal with
such a data set.

From the Air Quality dataset,

Dropping of less valued columns: stn_code, agency, sampling_date,
location_monitoring_agency do not add much value to the dataset in
terms of information. Therefore, we can drop those columns.

Changing the types to uniform format: When you see the dataset, you
may notice that the ‘type’ column has values such as ‘Industrial Area’
and ‘Industrial Areas’ — both actually mean the same, so let’s remove
such type of stuff and make it uniform.

Creating a year column To view the trend over a period of time, we

need year values for each row and also when you see in most of the
values in date column only has ‘year’ value. So, let’s create a new
column holding year values.

1. stn_code, agency, sampling_date, location_monitoring_agency do

not add much value to the dataset in terms of information.
Therefore, we can drop those columns.
2. Dropping rows where no date is available.
df=df.drop(['stn_code',
'agency','sampling_date','location_monitoring_station']
, axis = 1) #dropping columns that aren't
required
df=df.dropna(subset=['date']) # dropping rows where no date is
available

Missing data:

It is rare to have a real world dataset without having any missing

values. When you start to work with real world data, you will find that
most of the dataset contains missing values. Handling missing values
is very important because if you leave the missing values as it is, it
may affect your analysis and machine learning models. So, you need to
be sure that your dataset contains missing values or not. If you find
missing values in your dataset you must handle it. If you find any
missing values in the dataset you can perform any of these three task
on it:

1. Leave as it is
0. Filling the missing values

0. Drop them

For filling the missing values we can perform different methods. For
example, Figure 4 shows that airquality dataset has missing values.

The column such as SO2, NO2, rspm, spm, pm2_5 are the ones which
contribute much to our analysis. So, we need to remove null from those
columns to avoid inaccuracy in the prediction. We use the Imputer from
sklearn.preprocessing to fill the missing values in every column with
the mean.

# defining columns of importance, which

shall be used reguarly COLS = ['so2', 'no2',
'rspm', 'spm', 'pm2_5']
from sklearn.impute import SimpleImputer
# invoking SimpleImputer to fill missing values
imputer =
SimpleImputer(missing_values=np.nan,
strategy='mean') df[COLS] =
imputer.fit_transform(df[COLS])

Outliers:

That means an outlier indicates a data point that is significantly

different from the other data points in the data set. Outliers can be
created due to the errors in the experiments or the variability in the
measurements. Let’s look an example to clear the concept.

So, now the question is how can we detect

the outliers in the dataset. For detecting the
outliers we can use :
1. Box Plot
2. Scatter plot
3. Z-score etc.
Before we plot the outliers, let's change the labeling for
better visualization and interpretation for heart dataset.
df['target'] = df.target.replace({1: "Disease",
0: "No_disease"}) df['sex'] =
df.sex.replace({1: "Male", 0: "Female"})
df['cp'] = df.cp.replace({0: "typical_angina",
1:
"at
ypi
cal
_an
gin
a",
2:"
non
-an
gin
al
pai
n",
3:
"as
ymt
oma
tic
"})
df['exang'] = df.exang.replace({1: "Yes", 0: "No"})
df['fbs'] = df.fbs.replace({1:
"True", 0: "False"}) df['slope']
= df.slope.replace({0:
"upsloping", 1:
"flat",2:"downsloping"})
df['thal'] = df.thal.replace({1: "fixed_defect", 2:
"reversable_defect", 3:"normal"})

import
matplotlib.py
plot as plt
import
seaborn as sb
bxplt =
sb.boxplot(df["target"],d
f["chol"]) plt.show()
sb.boxplot(x='target', y='oldpeak', data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f95fceb7e10>

define continuous variable & plot

continous_features =
['age','trestbps','chol','thalach','oldpeak']
def outliers(df_out, drop = False):
for each_feature in
df_out.columns:
feature_data =
df_out[each_feature]
Q1 = np.percentile(feature_data, 25.) # 25th
percentile of the data of the given feature
Q3 = np.percentile(feature_data, 75.) # 75th
percentile of the data of the given feature
IQR = Q3-Q1 #Interquartile Range
outlier_step = IQR * 1.5 #That's we were
talking about above outliers =
feature_data[~((feature_data >= Q1 -
outlier_step) &
(feature_data <= Q3 +
outlier_step))].index.tolist() if
not drop:
print('For the feature {}, No of Outliers is
{}'.format(each_feature,
len(outliers))) if
drop:
df.drop(outliers, inplace = True,
errors = 'ignore') print('Outliers
from {} feature
removed'.forma

t(each_feature

))

outliers(df[co

ntinous_featur

es])

For the feature age, No of Outliers is 0

For the feature trestbps,
No of Outliers is 9 For the
feature chol, No of
Outliers is 5 For the
feature thalach, No of
Outliers is 1 For the
feature oldpeak, No of
Outliers is 5

Drop Outliers
outliers(df[continous_fe
atures],drop=True)
Outliers from age
feature removed Outliers
from trestbps feature
removed Outliers from
chol feature removed
Outliers from thalach
feature removed Outliers
from oldpeak feature
removed

Duplicate rows:
Datasets may contain duplicate entries. It is one of the easiest tasks to
delete duplicate rows. To delete the duplicate rows you can use
—dataset_name.drop_duplicates().

We have used another approach on

the heart dataset. Checking for the

duplicate rows

duplicated=df.duplicated().

sum()

if duplicated:

print("Duplicated rows

:{}".format(duplicated)) else:

print("No duplicates")

Duplicated rows :1

Displaying duplicate rows

duplicates=df[df.dupli

cated(keep=False)]

duplicates.head()

Remove the duplicate rows using the above mentioned method.

0. Data integration

So far, we've made sure to remove the impurities in data and make it clean.
Now, the next step is to combine data from different sources to get a unified
structure with more meaningful and valuable information. This is mostly
used if the data is segregated into different sources. To make it simple, let's
assume we have data in CSV format in different places, all talking about the
same scenario.
Say we have some data about an employee in a database. We can't expect all
the data about the employee to reside in the same table. It's possible that the
employee's personal data will be located in one table, the employee's project
history will be in a second table, the employee's time-in and time-out details
will be in another table, and so on. So, if we want to do some analysis about
the employee, we need to get all the employee data in one common place.
This process of bringing data together in one place is called data integration.
To do data integration, we can merge multiple pandas DataFrames using the
merge function.

In this exercise, we'll merge the details of students from two datasets,
namely student.csv and marks.csv. The student dataset contains columns
such as Age, Gender, Grade, and Employed. The marks.csv dataset
contains columns such as Mark and City. The Student_id column is
common between the two datasets. Follow these steps to complete this
exercise:
dataset1
="https://github.com/TrainingByPackt/Data
-Science-with-
Python/blob/master/Chapter01/Data/student
.csv"
dataset2 = "https://github.com/TrainingByPackt/Data-Science-with-

Python/blob/master/Ch
apter01/Data/mark.csv"
df1 =
pd.read_csv(dataset1,
header = 0)
df2 = pd.read_csv(dataset2, header = 0)

To print the first five rows of the first DataFrame, add

the following code: df1.head()

The preceding code generates the following output:

Figure : The first five rows of the first DataFrame
To print the first five rows of the second DataFrame,
add the following code: df2.head()
The preceding code generates the following output:

Figure : The first five rows of the second DataFrame

Student_id is common to both datasets. Perform data integration on both

the DataFrames with respect to the Student_id column using the
pd.merge() function, and then print the first 10 values of
the new DataFrame: df =
pd.merge(df1, df2, on = 'Student_id')
df.head(10)
Figure : First 10 rows of the merged DataFrame
Here, the data of the df1 DataFrame is merged with the data of the df2
DataFrame. The merged data is stored inside a new DataFrame called df.

We have now learned how to perform data integration. In the next

section, we'll explore another pre-processing task, data transformation.

0. Data transformation

Previously, we saw how we can combine data from different sources into a
unified dataframe. Now, we have a lot of columns that have different types
of data. Our goal is to transform the data into a machine-learning-digestible
format. All machine learning algorithms are based on mathematics. So, we
need to convert all the columns into numerical format. Before that, let's see
all the different types of data we have.

Taking a broader perspective, data is classified into numerical and categorical data:
Numerical data is further divided into the following:

Categorical data is further divided into the following:

From these different types of data, we will focus on categorical data.

In the next section, we'll discuss how to handle categorical data.

Handling Categorical Data

There are some algorithms that can work well with categorical data,
such as decision trees. But most machine learning algorithms cannot
operate directly with categorical data. These algorithms require the
input and output both to be in numerical form. If the output to be
predicted is categorical, then after prediction we convert them back to
categorical data from numerical data. Let's discuss some key challenges
that we face while dealing with categorical data:

Encoding

To address the problems associated with categorical data, we can use

encoding. This is the process by which we convert a categorical
variable into a numerical form. Here, we will look at three simple
methods of encoding categorical data.

Replacing

This is a technique in which we replace the categorical data with a

number. This is a simple replacement and does not involve much
logical processing. Let's look at an exercise to get a better idea of
this.

Exercise 6: Simple Replacement of Categorical Data with a Number

In this exercise, we will use the student dataset that we saw earlier. We
will load the data into a pandas dataframe and simply replace all the
categorical data with numbers. Follow these steps to complete this
exercise:

In AirQuality dataset, applying Simple Replacement of Categorical Data with a

Number.

df['type'].value_counts()
RRO 179013
I 148069
RO 86791
S 15010
RIRU 1304
O
R 158

The column type in the dataframe has 6 unique values, which

we will be replacing with numbers.
df['type'].replace({"RRO":1, "I":2, "RO":3,"S":4,"RIRUO":5,"R":6},
i
n
p
l
a
c
e
=
T
r
u
e
)
d
f
[
'
t
y
p
e
'
]
0 1.
0
1 2.
0
2 1.
0
3 1.
0
4 2.
0
...
435734 5.0
435735 5.0

435736 5.
0
435737 5.
0
435738 5.
0
Name: type, Length: 435735, dtype: float64

Converting Categorical Data to Numerical Data Using Label Encoding

This is a technique in which we replace each value in a categorical

column with numbers from 0 to N-1. For example, say we've got a list
of employee names in a column. After performing label encoding, each
employee name will be assigned a numeric label. But this might not be
suitable for all cases because the model might consider numeric values
to be weights assigned to the data. Label encoding is the best method to
use for ordinal data. The scikit-learn library provides LabelEncoder(),
which helps with label encoding.

df['state'].value_counts()

Maharashtra 6038
2
Uttar Pradesh 4281
6
Andhra Pradesh 2636
8
Punjab 2563
4
Rajasthan 2558
9
Kerala 2472
8
Himachal Pradesh 2289
6
West Bengal 2246
3
Gujarat 2127
9
Tamil Nadu 2059
7
Madhya Pradesh 1992
0
Assam 1936
1
Odisha 1927
8
Karnataka 1711
8
Delhi 8551
Chandigarh 8520
Chhattisgarh 7831
Goa 6206
Jharkhand 5968
Mizoram 5338
Telangana 3978
Meghalaya 3853
Puducherry 3785
Haryana 3420
Nagaland 2463
Bihar 2275
Uttarakhand 1961
Jammu & Kashmir 1289
Daman & Diu 782
Dadra & Nagar 634
Haveli
Uttaranchal 285
Arunachal Pradesh 90
Manipur 76
Sikkim 1
Name: state, dtype: int64

from sklearn.preprocessing import

LabelEncoder
labelencoder=LabelEncoder()
df["state"]=labelencoder.fit_tran
sform(df["state"]) df.head(5)
One Hot Encoding

In label encoding, categorical data is converted to numerical data, and

the values are assigned labels (such as 1, 2, and 3). Predictive models
that use this numerical data for analysis might sometimes mistake these
labels for some kind of order (for example, a model might think that a
label of 3 is "better" than a label of 1, which is incorrect). In order to
avoid this confusion, we can use one-hot encoding. Here, the
label-encoded data is further divided into n number of columns. Here, n
denotes the total number of unique labels generated while performing
label encoding. For example, say that three new labels are generated
through label encoding. Then, while performing one-hot encoding, the
columns will be divided into three parts. So, the value of n is 3.

dfAndhra=df
[(df['state
']==0)]
dfAndhra
dfAndhra['location'].value_counts()

from sklearn.preprocessing import OneHotEncoder

onehotencoder=OneHotEncoder(sparse=False,handle_unknow
n='error',drop='fi rst')
pd.DataFrame(onehotencoder.fit_transform(dfAndhra[["location"]]))

You have successfully converted categorical data to numerical data

using the OneHotEncoder method.

0. Error Correction

In heart dataset it can be observed that feature ‘ca’ should

range from 0–3, however, df.nunique() listed 0–4. So let's
find the ‘4’ and change them to NaN.

df['ca'].unique()

array([0, 2, 1, 3, 4], dtype=object)

df[df['ca']==4]
df.loc[df['ca']==4,'ca']=np.NaN

Similarly, Feature ‘thal’ ranges from 1–3, however, df.nunique()

listed 0–3. There are two values of ‘0’. They are also changed to NaN
using above mentioned technique.
Now, we can replace changed NaN values(missing values).

df.isna().sum()
df =
df.fillna
(df.media
n())
df.isnull
().sum()
0. Data model building

Once you've pre-processed your data into a format that's ready to be

used by your model, you need to split up your data into train and test
sets. This is because your machine learning algorithm will use the data
in the training set to learn what it needs to know. It will then make a
prediction about the data in the test set, using what it has learned. You
can then compare this prediction against the actual target variables in the
test set in order to see how accurate your model is. The exercise in the
next section will give more clarity on this.

We will do the train/test split in proportions. The larger portion of the

data split will be the train set and the smaller portion will be the test set.
This will help to ensure that you are using enough data to accurately
train your model.

In general, we carry out the train-test split with an 80:20 ratio, as per the
Pareto principle. The Pareto principle states that "for many events,
roughly 80% of the effects come from 20% of the causes." But if you
have a large dataset, it really doesn't matter whether it's an 80:20 split or
90:10 or 60:40. (It can be better to use a smaller split set for the training
set if our process is computationally intensive, but it might cause the
problem of overfitting – this will be covered later in the book.)

Create a variable called X to store the independent features. Use the

drop() function to include all the features, leaving out the dependent or
the target variable, which in this case is named ‘target’ for heart
dataset. Then, print out the top five instances of the variable. Add the
following code to do this:
X=
df.dro
p('tar
get',
axis=1
)
X.hea
d()
The preceding code generates the following output:

Figure : Dataframe consisting of independent variables

1. Print the shape of your new created feature matrix using

the X.shape command: X.shape

The preceding code generates the following output:

Figure : Shape of the X variable

In the preceding figure, the first value indicates the
number of observations in the dataset (284), and the
second value represents the number of features (13).
0. Similarly, we will create a variable called y that will store the target
values. We will use indexing to grab the target column. Indexing allows us to
access a section of a larger element. In this case, we want to grab the column
named Price from the df dataframe and print out the top 10 values. Add the
following code to implement this:
y
=
d
f
[
'
t
a
r
g
e
t
'
]
y
.
h
e
a
d
(
1
0
)
The preceding code generates the following output:

Figure : Top 10 values of the y variable

0. Print the shape of your new variable using the y.shape command:
y.shape
The preceding code generates the following output:
Figure : Shape of the y variable
The shape should be one-dimensional, with a length
equal to the number of observations (284).
0. Make train/test sets with an 80:20 split. To do so, use the
train_test_split() function from the sklearn.model_selection package. Add
the following code to do this: from sklearn.model_selection import
train_test_split

from sklearn import preprocessing

df=df.apply(preprocessing.LabelEncoder().fit_transform)

from sklearn.model_selection import

train_test_split X_train, X_test, y_train,
y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
In the preceding code, test_size is a floating-point value that
defines the size of the test data. If the value is 0.2, then it is an
80:20 split. test_train_split splits the arrays or matrices into train
and test subsets in a random way. Each time we run the code
without random_state, we will get a different result.

0. Print the shape of X_train, X_test, y_train, and y_test. Add the
following code to do this:
print("X_tr
ain :
",X_train.s
hape)
print("X_te
st :
",X_test.sha
pe)
print("y_tr
ain :
",y_train.sh
ape)
print("y_tes
t:
",y_test.sha
pe)
The preceding code generates the following output:

Figure : Shape of train and test datasets

You have successfully split the data into train and test sets.

In the next section, you will complete an activity wherein you'll

perform pre-processing on a dataset.

Supervised Learning

Supervised learning is a learning system that trains using labeled data

(data in which the target variables are already known). The model
learns how patterns in the feature matrix map to the target variables.
When the trained machine is fed with a new dataset, it can use what it
has learned to predict the target variables. This can also be called
predictive modeling.

Supervised learning is broadly split into two categories. These categories are as
follows:

Classification mainly deals with categorical target variables. A

classification algorithm helps to predict which group or class a data
point belongs to.

When the prediction is between two classes, it is known as binary

classification. An example is predicting whether or not a person has a
heart disease (in this case, the classes are yes and no).

from sklearn.tree import

DecisionTreeClassifier from
sklearn import metrics
clf =
DecisionTreeC
lassifier()
clf =
clf.fit(X_tra
in, y_train)
y_pred =
clf.predict(X
_test)
print(y_pred)

If the prediction involves more than two target classes, it is known as

multi-classification; for example, predicting all the items that a
customer will buy.

Regression deals with numerical target variables. A

regression algorithm predicts the numerical value of the target
variable based on the training dataset.

Linear regression measures the link between one or more predictor

variables and one outcome variable. For example, linear regression
could help to enumerate the relative impacts of age, gender, and diet
(the predictor variables) on height (the outcome variable).

Time series analysis, as the name suggests, deals with data that is
distributed with respect to time, that is, data that is in a chronological
order. Stock market prediction and customer churn prediction are two
examples of time series data. Depending on the requirement or the
necessities, time series analysis can be either a regression or
classification task.

Unsupervised Learning

Unlike supervised learning, the unsupervised learning process involves

data that is neither classified nor labeled. The algorithm will perform
analysis on the data without guidance. The job of the machine is to
group unclustered information according to similarities in the data. The
aim is for the model to spot patterns in the data in order to give some
insight into what the data is telling us and to make predictions.

An example is taking a whole load of unlabeled customer data and

using it to find patterns to cluster customers into different groups.
Different products could then be marketed to the different groups for
maximum profitability.
Unsupervised learning is broadly categorized into two types:

CONCLUSION:

References:
[1]What is Data Cleaning? How to Process Data for Analytics
and Machine Learning Modeling? | by Awan-Ur-Rahman |
Towards Data Science

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
VISI Machining MANUAL
90% (10)
VISI Machining MANUAL
158 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Ap Python
No ratings yet
Ap Python
12 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DA lab
No ratings yet
DA lab
27 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
Pandas
No ratings yet
Pandas
30 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lec4 SWN MC
No ratings yet
Lec4 SWN MC
45 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
1728086737277
No ratings yet
1728086737277
26 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
hduud
No ratings yet
hduud
55 pages
DP
No ratings yet
DP
9 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
TP2- ML -handling outliers
No ratings yet
TP2- ML -handling outliers
5 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Lab File
No ratings yet
Lab File
96 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Advance Python
No ratings yet
Advance Python
5 pages
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
9 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Amazon Categories
No ratings yet
Amazon Categories
5 pages
1 IPC-B2B20-ZS Datasheet 20181009
No ratings yet
1 IPC-B2B20-ZS Datasheet 20181009
3 pages
Lectura 3
No ratings yet
Lectura 3
10 pages
Eventory Final Paper PDF
No ratings yet
Eventory Final Paper PDF
47 pages
Access OCTOBER 2021
No ratings yet
Access OCTOBER 2021
224 pages
Google Meet, Jamboard and Classroom PDF
100% (1)
Google Meet, Jamboard and Classroom PDF
15 pages
Parol Building Instructions V 1 2 REV 1
No ratings yet
Parol Building Instructions V 1 2 REV 1
97 pages
Basic Training Notes
No ratings yet
Basic Training Notes
4 pages
Crime Investigation and Suspect Prediction
No ratings yet
Crime Investigation and Suspect Prediction
6 pages
Catalog Moxa Master Catalog 2016
No ratings yet
Catalog Moxa Master Catalog 2016
880 pages
LTE - Basic Optimization
No ratings yet
LTE - Basic Optimization
187 pages
Fpse
No ratings yet
Fpse
8 pages
AWS Cloud Practitioner Exam Steps - HPE PDF
0% (1)
AWS Cloud Practitioner Exam Steps - HPE PDF
5 pages
How Phones Are Designed To Addict Teacher V1
No ratings yet
How Phones Are Designed To Addict Teacher V1
9 pages
Minor Project Report - 1
No ratings yet
Minor Project Report - 1
5 pages
Simple Stupid Funnel Algorithm - Pathfinding
No ratings yet
Simple Stupid Funnel Algorithm - Pathfinding
17 pages
Tax Febrero
No ratings yet
Tax Febrero
69 pages
ITM100 - Exam Notes
No ratings yet
ITM100 - Exam Notes
68 pages
Jaya - BK CV
No ratings yet
Jaya - BK CV
3 pages
Python Projects - 2014 - Cassell - Python Standard Modules
No ratings yet
Python Projects - 2014 - Cassell - Python Standard Modules
8 pages
Guardium Data Protection - L4 - Architecture and Sizing - Components and Topology - Presentation
No ratings yet
Guardium Data Protection - L4 - Architecture and Sizing - Components and Topology - Presentation
28 pages
CV Milton Mannarino
No ratings yet
CV Milton Mannarino
2 pages
Resume of Paul Namala
No ratings yet
Resume of Paul Namala
5 pages
Ict2622 2024 TL 101 0 B
No ratings yet
Ict2622 2024 TL 101 0 B
18 pages
2019 Nat Photon - Mingbo He - High-Performance Hybrid Silicon and Lithium Niobate Mach-Zehnder Modulators For 100 Gbps and Beyond
No ratings yet
2019 Nat Photon - Mingbo He - High-Performance Hybrid Silicon and Lithium Niobate Mach-Zehnder Modulators For 100 Gbps and Beyond
7 pages
About The Opportunity: Job Title: Analyst - Digital Trust Function: Digital Location: PAN India Overview
No ratings yet
About The Opportunity: Job Title: Analyst - Digital Trust Function: Digital Location: PAN India Overview
4 pages
Contracts PDF
No ratings yet
Contracts PDF
14 pages
HSSC Computer Science List of Practical and Questions
No ratings yet
HSSC Computer Science List of Practical and Questions
2 pages
Doubly Linked List
No ratings yet
Doubly Linked List
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Part A Assignment 6

Uploaded by

Part A Assignment 6

Uploaded by

A

ssignment: 6 AIM : To Perform

Data preprocessing is the process of transforming raw data into an

Preprocessing of data is mainly to check the data quality. The quality

can be checked by the following

● Accuracy: To check whether the data entered is correct or not.

Major Tasks in Data Preprocessing:

Let's discuss each of these points in detail:

Different Ways of Cleaning Data

Changing the Datatype of the columns:

The variables types are in heart dataset are:

Hence, We will change the Datatypes of columns to appropriate types.

From the Air Quality dataset,

Creating a year column To view the trend over a period of time, we

1. stn_code, agency, sampling_date, location_monitoring_agency do

It is rare to have a real world dataset without having any missing

# defining columns of importance, which

That means an outlier indicates a data point that is significantly

So, now the question is how can we detect

define continuous variable & plot

For the feature age, No of Outliers is 0

We have used another approach on

the heart dataset. Checking for the

Displaying duplicate rows

Remove the duplicate rows using the above mentioned method.

To print the first five rows of the first DataFrame, add

The preceding code generates the following output:

Figure : The first five rows of the second DataFrame

Student_id is common to both datasets. Perform data integration on both

We have now learned how to perform data integration. In the next

Categorical data is further divided into the following:

From these different types of data, we will focus on categorical data.

Handling Categorical Data

To address the problems associated with categorical data, we can use

This is a technique in which we replace the categorical data with a

Exercise 6: Simple Replacement of Categorical Data with a Number

In AirQuality dataset, applying Simple Replacement of Categorical Data with a

The column type in the dataframe has 6 unique values, which

Converting Categorical Data to Numerical Data Using Label Encoding

This is a technique in which we replace each value in a categorical

from sklearn.preprocessing import

In label encoding, categorical data is converted to numerical data, and

from sklearn.preprocessing import OneHotEncoder

You have successfully converted categorical data to numerical data

In heart dataset it can be observed that feature ‘ca’ should

array([0, 2, 1, 3, 4], dtype=object)

Similarly, Feature ‘thal’ ranges from 1–3, however, df.nunique()

Once you've pre-processed your data into a format that's ready to be

We will do the train/test split in proportions. The larger portion of the

Create a variable called X to store the independent features. Use the

Figure : Dataframe consisting of independent variables

1. Print the shape of your new created feature matrix using

The preceding code generates the following output:

Figure : Shape of the X variable

Figure : Top 10 values of the y variable

from sklearn import preprocessing

from sklearn.model_selection import

Figure : Shape of train and test datasets

In the next section, you will complete an activity wherein you'll

Supervised learning is a learning system that trains using labeled data

Classification mainly deals with categorical target variables. A

When the prediction is between two classes, it is known as binary

from sklearn.tree import

If the prediction involves more than two target classes, it is known as

Regression deals with numerical target variables. A

Linear regression measures the link between one or more predictor

Unlike supervised learning, the unsupervised learning process involves

An example is taking a whole load of unlabeled customer data and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.