Part A Assignment 6
Part A Assignment 6
OBJECTIVE
● To understand the concept of Data Preprocessing
● To understand the methods of Data preprocessing.
THEORY:
Data Preprocessing:
1. Data cleaning:
Data Cleaning means the process of identifying the incorrect,
incomplete, inaccurate, irrelevant or missing part of the data and then
modifying, replacing or deleting them according to the necessity. Data
cleaning is considered a foundational element of basic data science.
● Continuous: age,
trestbps, chol, thalac,
oldpeak But, df.info()
gives the following
output.
<class
'pandas.core.frame.
DataFrame'>
RangeIndex: 303
entries, 0 to 302
Data columns (total
14 columns):
# Column Non-Null
Count Dtype
- -
-
1. age 303 non-null int64
2. sex 303 non-null int64
3. cp 303 non-null int64
4. trestbps 303 non-null int64
5. chol 303 non-null int64
6. fbs 303 non-null int64
7. restecg 303 non-null int64
8. thalach 303 non-null int64
9. exang 303 non-null int64
10. oldpeak 303 non-null float64
11. slope 303 non-null int64
12. ca 303 non-null int64
13. thal 303 non-null int64
14. target
303 non-null
int64 dtypes:
float64(1),
int64(13)
memory usage:
33.3 KB
Inconsistent column :
If your DataFrame contains columns that are irrelevant or you are never
going to use them then you can drop them to give more focus on the
columns you will work on. Let’s see an example of how to deal with
such a data set.
Changing the types to uniform format: When you see the dataset, you
may notice that the ‘type’ column has values such as ‘Industrial Area’
and ‘Industrial Areas’ — both actually mean the same, so let’s remove
such type of stuff and make it uniform.
Missing data:
1. Leave as it is
0. Filling the missing values
0. Drop them
For filling the missing values we can perform different methods. For
example, Figure 4 shows that airquality dataset has missing values.
The column such as SO2, NO2, rspm, spm, pm2_5 are the ones which
contribute much to our analysis. So, we need to remove null from those
columns to avoid inaccuracy in the prediction. We use the Imputer from
sklearn.preprocessing to fill the missing values in every column with
the mean.
Outliers:
import
matplotlib.py
plot as plt
import
seaborn as sb
bxplt =
sb.boxplot(df["target"],d
f["chol"]) plt.show()
sb.boxplot(x='target', y='oldpeak', data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f95fceb7e10>
t(each_feature
))
outliers(df[co
ntinous_featur
es])
Drop Outliers
outliers(df[continous_fe
atures],drop=True)
Outliers from age
feature removed Outliers
from trestbps feature
removed Outliers from
chol feature removed
Outliers from thalach
feature removed Outliers
from oldpeak feature
removed
Duplicate rows:
Datasets may contain duplicate entries. It is one of the easiest tasks to
delete duplicate rows. To delete the duplicate rows you can use
—dataset_name.drop_duplicates().
duplicate rows
duplicated=df.duplicated().
sum()
if duplicated:
print("Duplicated rows
:{}".format(duplicated)) else:
print("No duplicates")
Duplicated rows :1
duplicates=df[df.dupli
cated(keep=False)]
duplicates.head()
0. Data integration
So far, we've made sure to remove the impurities in data and make it clean.
Now, the next step is to combine data from different sources to get a unified
structure with more meaningful and valuable information. This is mostly
used if the data is segregated into different sources. To make it simple, let's
assume we have data in CSV format in different places, all talking about the
same scenario.
Say we have some data about an employee in a database. We can't expect all
the data about the employee to reside in the same table. It's possible that the
employee's personal data will be located in one table, the employee's project
history will be in a second table, the employee's time-in and time-out details
will be in another table, and so on. So, if we want to do some analysis about
the employee, we need to get all the employee data in one common place.
This process of bringing data together in one place is called data integration.
To do data integration, we can merge multiple pandas DataFrames using the
merge function.
In this exercise, we'll merge the details of students from two datasets,
namely student.csv and marks.csv. The student dataset contains columns
such as Age, Gender, Grade, and Employed. The marks.csv dataset
contains columns such as Mark and City. The Student_id column is
common between the two datasets. Follow these steps to complete this
exercise:
dataset1
="https://github.com/TrainingByPackt/Data
-Science-with-
Python/blob/master/Chapter01/Data/student
.csv"
dataset2 = "https://github.com/TrainingByPackt/Data-Science-with-
Python/blob/master/Ch
apter01/Data/mark.csv"
df1 =
pd.read_csv(dataset1,
header = 0)
df2 = pd.read_csv(dataset2, header = 0)
0. Data transformation
Previously, we saw how we can combine data from different sources into a
unified dataframe. Now, we have a lot of columns that have different types
of data. Our goal is to transform the data into a machine-learning-digestible
format. All machine learning algorithms are based on mathematics. So, we
need to convert all the columns into numerical format. Before that, let's see
all the different types of data we have.
Taking a broader perspective, data is classified into numerical and categorical data:
Numerical data is further divided into the following:
Encoding
Replacing
df['type'].value_counts()
RRO 179013
I 148069
RO 86791
S 15010
RIRU 1304
O
R 158
435736 5.
0
435737 5.
0
435738 5.
0
Name: type, Length: 435735, dtype: float64
df['state'].value_counts()
Maharashtra 6038
2
Uttar Pradesh 4281
6
Andhra Pradesh 2636
8
Punjab 2563
4
Rajasthan 2558
9
Kerala 2472
8
Himachal Pradesh 2289
6
West Bengal 2246
3
Gujarat 2127
9
Tamil Nadu 2059
7
Madhya Pradesh 1992
0
Assam 1936
1
Odisha 1927
8
Karnataka 1711
8
Delhi 8551
Chandigarh 8520
Chhattisgarh 7831
Goa 6206
Jharkhand 5968
Mizoram 5338
Telangana 3978
Meghalaya 3853
Puducherry 3785
Haryana 3420
Nagaland 2463
Bihar 2275
Uttarakhand 1961
Jammu & Kashmir 1289
Daman & Diu 782
Dadra & Nagar 634
Haveli
Uttaranchal 285
Arunachal Pradesh 90
Manipur 76
Sikkim 1
Name: state, dtype: int64
dfAndhra=df
[(df['state
']==0)]
dfAndhra
dfAndhra['location'].value_counts()
0. Error Correction
df['ca'].unique()
df[df['ca']==4]
df.loc[df['ca']==4,'ca']=np.NaN
df.isna().sum()
df =
df.fillna
(df.media
n())
df.isnull
().sum()
0. Data model building
In general, we carry out the train-test split with an 80:20 ratio, as per the
Pareto principle. The Pareto principle states that "for many events,
roughly 80% of the effects come from 20% of the causes." But if you
have a large dataset, it really doesn't matter whether it's an 80:20 split or
90:10 or 60:40. (It can be better to use a smaller split set for the training
set if our process is computationally intensive, but it might cause the
problem of overfitting – this will be covered later in the book.)
df=df.apply(preprocessing.LabelEncoder().fit_transform)
0. Print the shape of X_train, X_test, y_train, and y_test. Add the
following code to do this:
print("X_tr
ain :
",X_train.s
hape)
print("X_te
st :
",X_test.sha
pe)
print("y_tr
ain :
",y_train.sh
ape)
print("y_tes
t:
",y_test.sha
pe)
The preceding code generates the following output:
Supervised Learning
Supervised learning is broadly split into two categories. These categories are as
follows:
Time series analysis, as the name suggests, deals with data that is
distributed with respect to time, that is, data that is in a chronological
order. Stock market prediction and customer churn prediction are two
examples of time series data. Depending on the requirement or the
necessities, time series analysis can be either a regression or
classification task.
Unsupervised Learning
CONCLUSION:
References:
[1]What is Data Cleaning? How to Process Data for Analytics
and Machine Learning Modeling? | by Awan-Ur-Rahman |
Towards Data Science