SSRN-id3349586

2nd INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND SOFTWARE ENGINEERING (ICACSE-2019)
Impact of Preprocessing Methods on

Healthcare Predictions
Puneet Misra1 and Arun Singh Yadav2
Abstract—Machine learning (ML) is now a day gaining immense importance and is becoming a key technology as the rapid growth
of quality of medical data and information. But the early and accurate detection of disease is still a challenge due to the complex,
incomplete and multidimensional healthcare data. Data preprocessing is an essential step of ML whose primary goal is to provide
processed data to improve the prediction accuracy. This study summarizes the popular data preprocessing steps based on their usages,
popularity and literature. After that the selected preprocessing methods is applied on the raw data which is then used by classifiers for
predictions. In this experiment we have taken diabetes classification problem. Type II diabetes mellitus (T2DM) is a major disease
with high penetrance in human around the world and still rising. This may cause other serious complications like kidney failure, heart
failure, blindness etc. The early detection and diagnosis help to identify and may avoid these complications. Several classification
algorithms exist but selecting the best classifier surely improves the accuracy of the predictions. The preprocessing methods selected
in this study are Multiple Imputation, k-means for missing values treatment, Discretization to change in discrete values, Standard
scaler, Min-Max scalar for feature scaling and, Random Forest (RF) for feature selection. For the classification Logistic Regression
(LR), Artificial Neural Network (ANN), Support Vector Machine (SVM) and Random Forest (RF) are used. To evaluate the
performance of model accuracy, sensitivity, specificity are used. This study compares the model performance with and without
preprocessed data and has proved that the selected preprocessing methods significantly improves the model performance.
Keywords: Machine learning, Disease prediction, Classification, Preprocessing, Multiple Imputation, k-NN, Standard Scaler, Min-
Max scalar, RFE, LR, ANN, SVM, RF

I. INTRODUCTION The main objective of this paper to study the impact of

selected preprocessing techniques on prediction value and
Machine Learning (ML) is gaining importance day by day justify that data preprocessing using intelligent techniques
due to its capability work with heterogeneous data set. ML significantly improves the model performance. Another
algorithms directly learn from the data, produce the hidden objective is to study various preprocessing methods and
insights and can predict or forecast the future outcomes on select the best among them and this selection has been
the basis of its learning[1]. Predictions can be done either done on the basis of their usages, popularity and should
by classification or regression approach. The classification have been widely cited by research community.
accuracy of the predictions depends on the quality of the We have started with the study of the most influential data
data. Data generated from various sources may have preprocessing algorithms under various section of it and
missing values, noisy, inconsistent, voluminous and class select the widely used among them. In next section we
imbalanced[2]. This imperfect data requires data discussed about the dataset one public and another real-
preparation stage to clean and prepare the data[3] for world. The experimental setup and proposed framework is
further analysis. To get quality of data, machine learning explained in next section. We further described the model
provides one of the most meaningful steps called data evaluation. The detailed discussion on the result and
preprocessing. This step usually takes the significant performance of the model before and after applying data
amount of time[4] and should be implemented carefully to processing techniques is covered in next section and the
improve the overall model performance. last section covered the conclusion and future work.
The data preprocessing task includes certain steps like
data preparation, integration, cleaning, normalization, II. STUDY OF DATA PROCESSING ALGORITHMS
scaling and data reduction techniques to reduce
complexity, to noisy and irrelevant elements using feature A. Data Cleaning
selection and discretization etc. After this the outcome
The data taken from the real-world problem is seldom
expected a final dataset for further analysis using ML
algorithms.1 clean and complete, specially the healthcare field. The
data cleansing/cleaning step deals with the treatment of
missing values, errors and inconsistencies in the dataset.
1,2Department of Computer Science, University of To deal with this issue various methods or techniques are
Lucknow, Lucknow, India-226007 evolved since past. Choosing the appropriate technique
E-mail: 1puneetmisra@gmail.com, depends on various factors[5]. Proper handling of
2arun.ai.lkouniv@gmail.com imperfect data involved certain steps described below:
Electronic copy available at: https://ssrn.com/abstract=3349586

Impact of Preprocessing Methods on Healthcare Predictions 145
1) Missing Value Treatment Dropping the missing cases or attribute will never be the
best choice because it may contain some meaningful
Missing value (MV) is defined as the data value that is not
insights in data analysis. Hence some probabilistic method
present in the cell of the corresponding column. The
and imputation methods always get priority in MVs
reasons of MVs in healthcare context can be the human
treatments.
omission, not applicable instance, not recorded
electronically by the sensor, patient not present on the 3) Parameter Estimation Method
ventilator due to a medical decision, the patient condition
is unrelated for a particular variable, electricity failure, Traditionally statisticians used the probabilistic approach
database synchronization and others. Any statistical to estimate the MVs. In this the maximum likelihood
analysis or machine learning activity on the data having procedure is used to estimate the parameters of the model
MVs may produce the undesired or biased result and for complete dataset. To optimize the maximum-
improper handling of missing data can produce misleading likelihood procedure R.A. Fisher introduce a meta
conclusions[6]. According to the survey[6] problems algorithm Expectation-Maximization[8]. It estimates the
associated with the MVs can be loss of efficiency, parameters of a probability distribution which better fit the
complications in handling and analyzing data and biased observed data in the iterative processes[10]. But its
result. But the incomplete data is the subject to study due convergence rate is slow and faces issue to deal with non-
to its impact on the classification accuracy[7]. Before linear high-dimensional data. This method also
applying any treatment first, we need to understand the underestimates the standard errors at the time of
pattern of missingness. Researchers Little and Rubin estimation process.
categorize it into three ways[8]: Missing completely an Multiple Imputation (MI): Rubin (1976) first developed a
random(MCAR), Missing at random(MAR) and Not framework for incomplete data after some years in 1987
Missing at random(NMAR). The first one occurs when the he proposed a MI method. The exhaustive discussion on
missing data does not depend on any other attribute of the Multiple Imputation methods can be found in Schafer et
dataset, second occurs when the distribution of MVs of an al. research article[9] . It has a clear advantage over the
attribute depends on observed data of another attribute but parameter estimation method and single imputation
itself and the third one happens when the distribution of method. Here the missing terms is imputed more than
MVs of an attribute depends on itself. Handling of MVs once from the Bayesian posterior distribution, which
can be done either discarding it or imputing the maintains the variation in the dataset. It additionally
missing data. computes the variation-based error, sometimes called
‘between imputation error’. MI is very simple and can be
2) Discarding the Missing Values implemented in so many ways like MCMC (Markov
Chain Monte Carlo) and MICE (Multiple Imputation by
The most usual approach to discard the MVs but this Chained Equations) methods are widely used. The MCMC
approach is not so practical[9] because if the train data is a simulation-based method while MICE is multivariate
have a large number of missing values then the produced imputation method and gives better result on MAR data.
result must biased. If the dataset has a small number of MICE use series of regression models to compute the
missing values, then we must assure that analysis on the missing data where type of regression is depending on the
remaining part will not produce the inference bias. The type of data. If the missing variable distribution is
deletion of the MVs can be done in following ways: continuous then linear regression and if its distribution is
 Listwise deletion (Deleting rows): In this case the in binary form logistic regression is used[14].
complete case analysis is done and removes all the
observed cases having more than one missing 4) K–Nearest Neighbor Imputation (KNNI)
value. But this approach is helpful in small number
The traditional statistical models are working fine for
of missing cases. It works well when dataset has
the MCAR missing pattern and which is rare. MVs imputation with its limitations as the nature of data,
percentage of missing data etc. In the same series authors
 Pairwise deletion: This method attempts to G. Batista and M. Monard proposed machine learning
minimize the error in list wise deletion. In this the technique as an imputation method. Authors uses k-NN
attribute with MVs is deleted if it is not use as a algorithm as an imputation method with three other
case of another attribute while analyzing the data. imputation techniques mean, mode substitution and C4.5,
It strengthens the analysis power but create other CN2[11] on the three different healthcare problem. All the
complications like producing standard error. four methods were analyzed with different percentage of
 Dropping attribute completely: This is very rare missing data of three datasets. This research states that all
and, in my opinion, sometimes you can drop the the imputation techniques work well but k-NN algorithm
attribute completely if the missing observations are as 10-NNI outperform on the breast cancer problem. k-
more than 60% and the attribute looks insignificant NN[15] is most widely used clustering algorithm[16]
in the analysis. Sometime attributes with missing where k-neighbors are chosen on the basis of distance
values should be kept due to high relevance. measure(Euclidean, Hamming, Manhattan, etc.). It selects

146 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019)
the similar features values and creates the distance matrix understand and interpret and also reduces model
and on the basis of that it creates the clusters of similar complexity. One of the major drawbacks with this
values. It can work on both discrete (Hamming distance) technique is the loss of information. FE is another data
and continuous data (Euclidean distance). There is no reduction technique which generates the new and stronger
specific criterion to select the best k-value, but elbow rule features that has the strong impact on the classifier result.
helps here to decide best k-value. From the machine Principal Component Analysis(PCA)[23] is one of the
learning point of view k-NN as an imputation is an most widely used linear method in this category in context
intuitive solution[17]. Unlike the other discussed of ML where Factor Analysis(FA)[24] is second on the
imputation method this algorithm doesn’t need a model to number and widely used in statistical learning. Instead of
predict the MVs hence it is more popular and widely used. dropping weak predictors PCA generate new predictors
uncorrelated to each other. But generally PCA outperform
B. Feature Scaling if dataset contains independent but uncorrelated predictors
Feature scaling have significant impact on some and one more issue with this is the selection of no. of
algorithms[18] (PCA, KNN, SVM and NN, if use gradient principal components[25]. Feature selection(FS) technique
descent optimization) while minimal on others. The does not make any change in the original feature which
features of the dataset may vary in terms of magnitude, helps in the interpretation and understanding[26]. Feature
range and unit. Most of ML algorithms work on the subset selection is the process of identifying and removing
magnitude of the measurement not on the unit. ML as much irrelevant and redundant information as
algorithms uses Euclidean distance between data points so possible[27]. The detail discussion on FS technique, its
the distance measure of a higher magnitude feature and importance, benefits can be seen in this article[28]. It is
low magnitude feature would produce undesired results. one of the most researched step under data preprocessing
Hence it is the necessity to scale all the features at same since past[29][30]. Feature subset selection can be done
level. Data scaling can be done either by Z-score either by filter or wrapper approach. Wrapper method use
algorithm or Min-Max algorithm. The z-score algorithm ML to decide the best feature by train it while filter
(standardization) scaled the features centered near to zero method don’t use the ML. Filter method is fast, but it is
i.e. mean of the distribution µ=0 and standard deviation not the best choice in case of small dataset while wrapper
method is slow, computationally costly but works well
σ=1. Here the original value is replaced by ̀ = . Min-
massive or small dataset. In this study Random Forest
Max scaling is another approach which scaled the feature (RF) method is used to select the best subset of
to a fixed range either [-1, 1] or [0, 1]. This takes features[31]. RF ranks the features using ensemble of
minimum and maximum sample as [ & ] and decision tree and other tree-based approaches. Each node
replaced the original features to scaled features as ̀ = of the decision tree signifies a condition on single feature,
. and it divides it into two parts, so the similar response
values remains in the same set. It selects the optimal
C. Data Reduction feature on measure of impurity. In each training step it
Data is generating in a rapid pace through various checks the how much each feature decreases the weighted
impurity of decision tree.
electronic devices which is very large and high
dimensional in nature, especially medical data (patient III. DATASET DESCRIPTION
information, lab test, symptoms, device generated
etc.)[10]. Handling multidimensional is a tedious task and In this study we have taken two different diabetes type-2
prediction with all the features sometimes may not be dataset as a classification problem and applied selected
useful because the presence of irrelevant features. Hence preprocessing techniques to check the impact of data
selecting the best features or generating the new features preprocessing on the classifier predictions accuracy. The
from existing can improve the classifier accuracy. This first dataset PIMA Indian Diabetes Dataset (PIDDs)[32] is
introduced the data reduction task to select most relevant taken from the Machine Learning Database repository of
predictors either by feature selection(FS), extraction(FE) University of California, Irvine (UCI)[33] and the second
and discretization methods[10]. [19] Study shows that one LUDB2 from the Bioinformatics Research Lab,
more than 75% of data reduction task is concerned feature Lucknow University. Some previous studies show that
selection, about 15% to feature extraction and less than Pima Indians may be genetically predisposed to type-2
10% concerned discretization method. Discretization
diabetes which was 19 times of the any nearby typical
converts the quantitative data into qualitative data with
certain number of intervals. Discretization is getting town. The PIDDs (table1) contain total 9 features where 8
importance in research during recent years[20]. It becomes features are used as predictor variables and last one class
the necessity where algorithm works well on nominal data variable indicating the onset of diabetes within 5 years.
like Decision tree[21], Naïve Bayes[22]. Discretization The dataset contains total 768 female patients of age 21
simplifies the data and makes learning faster and accurate years and above where 268 instances are diabetic and 500
where one more advantage is discrete variables are easy to instances are non-diabetic.

The LUDB2 dataset (table 2) contain 12 features where 11 studies have shown that the prediction accuracy depends
features are used as independent variables, means the on the quality of data. The quality of data can be check
predictors and last feature is the dependent variable or either exploring the datasets graphically or analytically.
class variable which indicates the diabetes status (yes/ no). So, in second part we have done exploratory data analysis
The dataset contains 1000 instances of patient’s data to identify the MVs, noisiness in the given datasets and
where 500 patients are non-diabetic and rest 500 are then applied imputation technique k-NN for MVs
diabetic. In this dataset family history is taken as a one treatment and store it in two different variables. After
categorical variable to represent the heredity in the family. dealing with MVs the scaling of features is done by
Table 1: Pidds type 2 diabetes dataset
MinMax Scalar because features vary in magnitude and
nature (categorical and non-categorical). Selection of best
S.No. Columns Variables Description features is the next hurdle, so we have used RF as a
1 AGE Age (years) Age of the female patients feature selection. This preprocessed data then applied to
[21 to 81] different classifier to predict the diabetic and non-diabetic
patient, see table V & VI for the results. All the
1. PREG Pregnancy No. of times a woman gets experimental work has been done on Jupyter Notebook v.
pregnant [0 to 17]
IDE. The Python 3.0 programming language is used to for
2. BMI Body mass Body Mass Index (Weight
index in kg/height in m2) [0 to the analysis and for building prediction models. We have
67.1] used numpy, pandas, matplotlib libraries of python 3.0 for
3. PGLU Plasma Plasma Glucose dataset representation, processing and visualization and
Glucose concentration measured scikit-learn for building the machine learning models for
concentration using a 2-hour oral glucose
tolerance test [ 0 to 199] predictions.
4. DBP Diastolic Diastolic blood pressure
blood (mm Hg) [0 to 122] V. EVALUATION MATRIX
pressure
INSU Insulin 2-hour serum insulin (mu
To evaluate the model performance Accuracy, sensitivity,
5.
U/ml) [0 to 846] specificity are used. Accuracy measure is always preferred
6. TFST Triceps skin Triceps skin fold choice in classification problem if target variable in
fold thickness(mm) [ 0 to 99 ] dataset is approximately balanced. In our case LUDB2
thickness
dataset have 50% diabetic and rest 50% are non-diabetic
7. DPF Diabetes Diabetes pedigree function but PIDDs is imbalanced and contains 70% positive and
pedigree [ 0.08 to 2.42 ]
function
30% negative cases. To calculate performance measures
8. CLASS Diabetes Patient with diabetes onset Accuracy, sensitivity and specificity certain matrix is
status (0 or within five years needed i.e. True Positive (TF), True Negative (TN), False
1) Positive (FP) and False Negative (FN). The formulation of
Table 2: LUDB2 dataset Type-2 Diabetes Dataset with variable these matrices is given in table below:
description
Table 3: Model evaluation metrices
S.No. Columns Variables Description
Matric Formula
1 GNDR Gender Male and Female
2 FHIST Family History anyone in family have Accuracy (TP+TN)/(TP+TN+FP+FN)
db2 positive or not
Sensitivity TP/(FN+TP)
3 AGE Age (Years) 20 to 70 years
4 BMI BMI (kg/m2) Body mass index Specificity FP/(FP+TN)
5 WHRATIO W/H ratio Waist-Height Ratio
VI. RESULT AND DISCUSSION
6 SBP SBP (mmhg) Systolic Blood Pressure
7 DBP DBP (mmhg) diastolic blood pressure The PIDDs dataset, when used without preprocessing as
8 FPG FPG (mg/dl) Fasting Plasma Glucose an input to the classifiers LR (Logistic Regression), ANN
9 PPG PPG (mg/dl) Postprandial Blood (Artificial Neural Network), SVM (support vector
Glucose
10 HBA1C Glucose HbA1c hmoglobin A1C (HbA1c) machine) and Random Forest, models exhibit the
(%) accuracies 75%, 67%, 65% and 74%. Here, it can be
12 CLASS Diabetic Status Positive or Negative observed that Logistic regression model gives the best
(Months) response without applying any preprocessing methods
while Random forest is on second top performer. But
IV. EXPERIMENTAL SETUP
ANN and SVM is not performing well. The reason why
The whole experiment is divided into two parts: they are giving the worst result can be missing values and
prediction of classifier accuracy with and without noise and both models requires scaled values. But when
preprocessing techniques. In first part we have used the we applied same methodology on LUDB2 dataset, all the
dataset directly to train and test the Neural Network classifiers exhibit good accuracies 98%, 94%, 74% and
classifier and store the result in table 3. As previous 100%. So here Random Forest is outperforming and can

classify positive and negative cases with 100% accuracy. Classification Accuracies of
The reason of this can be LUDB2 dataset have no missing PIDDs dataset
values, no noise and balanced. But after applying the
preprocessing methods on both the dataset and when 0.77 0.8 0.790.78
1 0.750.670.650.74
applied on classifiers, results got improved. We start the
experiment by investigating the missing values present in 0.5
the dataset and it is found that LUDB2 dataset have no 0
missing data values. But FIHST attribute have some LR ANNSVM RF LR ANNSVM RF
missing data as NaN which means that no one is diabetic Without With
in his family, hence we simply replace it with value zero Preprocessing Preprocessing
(0). But when we explored PIDDs dataset graphically and
analytically, it contains 4% to 48% missing or incorrect Figure 1: Comparison of classification accuracies before and after
values (refer to table IV). applying preprocessing methods in PIDDs dataset
Table 4: Incorrect or missing value instances present in different
attributes of PIDDs dataset out of 786 rows
S.No. No. of Missing or

Attribute incorrect values
1. PREG 5
2. BMI 35
3. PGLU 227
4. DBP 374 Figure 2: Feature importance using random forest in PIDDs dataset
5. INSU 11 DPF (Diabetes Pedigree function) is getting good
The above table shows that DBP (Diastolic Blood importance but showing very less correlation with class
Pressure) contains 48% incorrect or missing values (only attribute. Hence, we have not taken these three (DBP, DPF
those instances that have value 0) and PGLU (Plasma and TFST) attributes for further analysis. So, we have
Glucose concentration) have 29% missing instances. done the next analysis by using PGLU, AGE, BMI, INSU
Theoretically 25% to 30% missing values are accepted, and PREG. The same process when we have done with
but we cannot delete them because we don’t know how LUDB2 dataset we have found that HbA1C, FPG, PPG,
much they impacts the predictions. We replaced missing W/H ratio, BMI, SBP and AGE are the top rankers
instances with kNN method but in both the datasets have (see fig. 2).
extreme values that may impact the learning process of the
model, but the treatment of these outliers is out of the Classification Accuracies of
scope of this study, so we as it is keep them in dataset. In LUDB2 dataset
LUDB2 dataset GNDR & FHIST attributes are of object
1.5 0.98 0.94 0.74 1 0.980.96 1 1
type so we convert them into integer type but when we 1
applied this on simple logistic regression model, we found 0.5
that variability is not impacting the accuracy, so we 0
discretize it into 0 & 1. The same case is done with PIDDs LR ANNSVM RF LR ANNSVM RF
dataset the attribute GNDR to 0 & 1. The same case we Without With
have found with PREG attribute in PIDDs dataset, so Preprocessing Preprocessing
discretize it into 0 & 1.
We have used distance-based classification algorithms Figure 3: Comparison of classification accuracies before and after
that’s why we need to scale the feature for better accuracy. applying preprocessing methods in LUDB2 dataset
So, for this we have used Min-Max Scaler and Standard
Scaler method in both the datasets PIDDs and LUDB2.
For feature selection first we checked the feature
importance and after that we selected the best among
them. After applying Random Forest algorithm on PIDDs
to calculate feature importance, we have found that
features PGLU (Plasma Glucose concentration) is getting
immense importance, INSU(Insulin) is on second place,
BMI on third and AGE scored fourth place. But BMI is
highly correlated with TFST (skin thickness). DBP had
the highest missing values (table 3) and has lowest ranking
in feature estimators. Figure 4: Ranking of features using random Forest in LUDB2 dataset

After applying all the preprocessing methods on both the have used two different datasets of healthcare sector for
datasets a significant amount of increase can be seen by all predicting the type 2 diabetes onset. Firstly, we started
the classifiers in the accuracies in PIDDs dataset (refer to with the detailed study of different preprocessing methods,
table V). While LUDB2 dataset is showing the small its impacts on the predictions, and its usability with
increase in ANN 94% to 96% but LR and Random Forest limitations in different scenarios. We have found that
classifiers has no change in accuracy, but sensitivity and traditional missing data imputation methods were
specificity interchangeably changes in LR. But SVM inefficient with large set of missing values in a specific
achieved the highest change in accuracy from 74% attribute while machine learning based kNN methods has
(without preprocessing) to 100% (with preprocessing) worked well in this scenario. One more conclusion has
(refer to table VI). seen that different distance-based classifiers worked
The ANN models exhibit the highest change in accuracy differently with Standard Scaler and Min Max Scaler.
in PIDDs dataset from 67% (without preprocessing) to Study also shows that feature selection is the preferred
80% (with preprocessing) where SVM is second on the list method in data reduction and tree-based methods are
and achieved 79% of accuracy. We can draw one more widely used with small dataset. The experimental study on
observation from here that LR is the less complex model, public and private dataset have some interesting
but it is also able to classify the diabetic and non-diabetic conclusions. The preprocessed PIDDs dataset shows an
patients with 77% (PIDDs) and 98% (LUDB2) after average accuracy of 80% with ANN classifier, as against
applying the preprocessed dataset, which is notable. the accuracy of without preprocessed dataset which is
The comparative analysis of the accuracies achieved by 67%. The improved accuracy is very near 79% with SVM
various classifiers on two different datasets can be seen in on tested modified dataset. Hence, ANN is the top
the fig. 3 and fig. 4 while the detailed performance performer on PIDDs dataset.
analysis can be seen in table V and table VI. When the same experiment has done with real private
Table 5: Outcome of classifiers before and after applying pre-processing LUDB2 dataset, which has no missing values, and after
techniques on PIDDs datasets preprocessing, all the classifiers are showing the similar
accuracies with small increment. Almost all the classifiers
Preprocessing Classifie Accurac Sensitivit Specificit achieved more than 90% accuracy expect SVM on
Techniques r y y y
original LUDB2 dataset. But in feature selection it is
No Preprocessing LR 0.75 0.57 0.86 found that only Hb1AC and FPG attributes can classify
With Preprocessing LR 0.77 0.56 0.88 the patient accurately, but we have taken more for the
caution. Hence the preprocessed LUDB2 dataset improved
No Preprocessing ANN 0.67 0.25 0.89 the accuracies a little but SVM shows the drastic increase
With Preprocessing ANN 0.80 0.65 0.88 from 74% to 99%.
No Preprocessing SVM 0.65 0 1 This proposed approach is tested on similar disease health
With Preprocessing SVM 0.79 0.82 0.78 data with some common attributes and has shown that
No Preprocessing Random 0.74 0.48 0.89 selected preprocessing methods surely improves the
Forest predictions. Some more studies and experiment can do
With Preprocessing Random 0.78 0.67 0.84 with other healthcare prediction problems and can be used
Forest on any prediction model by applying the same approach.
Table 6: Outcome of classifiers before and after applying pre-processing
techniques on LUDB2 datasets REFERENCES
Preprocessing Classifie Accurac Sensitivit Specificit [1] S. Ben-David and S. Shalev-Shwartz, Understanding Machine
Techniques r y y y Learning: From Theory to Algorithms. 2014.
No Preprocessing LR 0.98 0.96 1.0 [2] S. Batra and S. Sachdeva, “Organizing standardized electronic
With Preprocessing LR 0.98 1.0 0.96 healthcare records data for mining,” Heal. Policy Technol., 2016.
[3] R. Duggal, S. Shukla, S. Chandra, B. Shukla, and S.K. Khatri,
No Preprocessing ANN 0.94 0.88 1.0 “Impact of selected pre-processing techniques on prediction of risk
With Preprocessing ANN 0.96 0.92 1.0 of early readmission for diabetic patients in India,” Int. J. Diabetes
Dev. Ctries., vol. 36, no. 4, pp. 469–476, 2016.
No Preprocessing SVM 0.74 1.0 0.48 [4] wp185007, Data Preparation for Data Mining.doc. 2006.
With Preprocessing SVM 1.0 1.0 1.0 [5] F. Cismondi, A.S. Fialho, S.M. Vieira, S.R. Reti, J.M. C. Sousa,
and S.N. Finkelstein, “Missing data in medical databases: Impute,
No Preprocessing Random 1.0 1.0 1.0 delete or classify?,” Artif. Intell. Med., 2013.
Forest [6] H. Wang and S. Wang, “Mining incomplete survey data through
With Preprocessing Random 1.0 1.0 1.0 classification,” Knowl. Inf. Syst., Vol. 24, No. 2, pp. 221–233,
Forest 2010.
[7] I.A. Gheyas and L.S. Smith, “A neural network-based framework
VII. CONCLUSION for the reconstruction of incomplete data sets,” Neurocomputing,
vol. 73, no. 16–18, pp. 3039–3065, 2010.
This work proposes the impact and importance of selected [8] D.B.R. Roderick J.A. Little, “Statistical Analysis with Missing
preprocessing methods on the healthcare predictions. We Data,” WILEY Ser. Probab. Stat., p. 96, 2010.

[9] J.L. Schafer and J.W. Graham, “Missing data: Our view of the state [22] Y. Yang and G.I. Webb, “Discretization for naive-Bayes learning:
of the art,” Psychol. Methods, vol. 7, no. 2, pp. 147–177, 2002. Managing discretization bias and variance,” Mach. Learn., Vol. 74,
[10] S. García, J. Luengo, and F. Herrera, “Tutorial on practical tips of No. 1, pp. 39–74, 2009.
the most influential data preprocessing algorithms in data mining,” [23] Share, P.C. Analysis, and G.H. Dunteman, “Principal Components
Knowledge-Based Syst., Vol. 98, pp. 1–29, 2016. Analysis,” in Principal Components Analysis, SAGE Publications,
[11] G.E.A.P.A. Batista and M.C. Monard, “An analysis of four missing Inc, 1989, p. 96.
data treatment methods for supervised learning,” Appl. Artif. [24] C.W.M. Jae-On Kim, Factor Analysis: Statistical Methods and
Intell., Vol. 17, No. 5–6, pp. 519–533, 2003. Practical Issues, 14th ed. SAGE Publications, Inc, 1978.
[12] J. Luengo, S. García, and F. Herrera, On the choice of the best [25] P.R. Peres-Neto, D.A. Jackson, and K.M. Somers, “How many
imputation methods for missing values considering three groups of principal components? stopping rules for determining the number
classification methods, Vol. 32, No. 1. 2012. of non-trivial axes revisited,” Comput. Stat. Data Anal., Vol. 49,
No. 4, pp. 974–997, 2005.
[13] J. Alcalá-Fdez et al., “KEEL: A software tool to assess
evolutionary algorithms for data mining problems,” Soft Comput., [26] N. Poolsawad, L. Moore, C. Kambhampati, and J.G.F. Cleland,
“Issues in the Mining of Heart Failure Datasets,” Int. J. Autom.
Vol. 13, no. 3, pp. 307–318, 2009.
Comput., vol. 11, no. 2, pp. 162–179, Apr. 2014.
[14] K. Baclawski, “Multiple Imputation by Chained Equations,” vol.
30, no. April, pp. 1–3, 2011. [27] thesis, “Correlation-based Feature Selection forMachine Learning,”
no. April, 1999.
[15] N.S. Altman, “An Introduction to Kernel and Nearest-Neighbor
[28] I. Guyon, “An Introduction to Variable and Feature Selection
Nonparametric Regression,” Am. Stat., Vol. 46, No. 3, pp. 175–
185, 1992. Isabelle,” J. ofMachine Learn. Res. 3 1157-1182, Vol. 3, pp. 1157–
1182, 2003.
[16] A.K. Jain, “Data clustering: 50 years beyond K-means,” Pattern
[29] A.L. Blum and P. Langley, “Selection of relevant features and
Recognit. Lett., Vol. 31, No. 8, pp. 651–666, 2010.
examples in machine learning,” Amficial Intell., Vol. 17, no. 1–2,
[17] L. Beretta and A. Santaniello, “Nearest neighbor imputation
pp. 245–271, 1997.
algorithms: A critical evaluation,” BMC Med. Inform. Decis. Mak.,
[30] R.K.A and L. George H. John b, “Wrappers for feature subset
Vol. 16, Suppl 3, 2016.
selection,” Artif. Intell., Vol. 97, No. 1–2, pp. 273–324, 1997.
[18] X.H. Cao, I. Stojkovic, and Z. Obradovic, “A robust data scaling [31] A. Hapfelmeier and K. Ulm, “A new variable selection approach
algorithm to improve classification accuracies in biomedical data,” using Random Forests,” Comput. Stat. Data Anal., Vol. 60, No. 1,
BMC Bioinformatics, Vol. 17, No. 1, pp. 1–10, 2016. pp. 50–69, 2013.
[19] A. Idri, H. Benhar, J.L. Fernández-Alemán, and I. Kadi, “A [32] V. Sigillito, “Pima Indians Diabetes Database,” National
systematic map of medical data preprocessing in knowledge Institute of Diabetes and Digestive and Kidney Diseases, 1990.
discovery,” Comput. Methods Programs Biomed., Vol. 162, [Online]. Available: http://ftp.ics.uci.edu/pub/machine-learning-
pp. 69–85, 2018. databases/pima-indians-diabetes/pima-indians-diabetes.names.
[20] H. Liu, “Discretization: An Enabling Technique,” Data Min. [33] “PIMA INDIAN DIABETES DATASET,” UCI Machine Learning
Knowl. Discov., Vol. 6, pp. 393–423, 2002. Repository, 1988. [Online]. Available:
[21] J.R. Quinlan, “Improved Use of Continuous Attributes in C4.5,” J. http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
Artif. Int. Res., Vol. 4, No. 1, pp. 77–90, Mar. 1996. [Accessed: 15-Apr-2018].

SSRN-id3349586

Uploaded by

Copyright:

Available Formats

SSRN-id3349586

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN-id3349586

Uploaded by

Copyright:

Available Formats

2nd INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND SOFTWARE ENGINEERING (ICACSE-2019)

Impact of Preprocessing Methods on

I. INTRODUCTION The main objective of this paper to study the impact of

Electronic copy available at: https://ssrn.com/abstract=3349586

Electronic copy available at: https://ssrn.com/abstract=3349586

Electronic copy available at: https://ssrn.com/abstract=3349586

Electronic copy available at: https://ssrn.com/abstract=3349586

S.No. No. of Missing or

Electronic copy available at: https://ssrn.com/abstract=3349586

Electronic copy available at: https://ssrn.com/abstract=3349586

Electronic copy available at: https://ssrn.com/abstract=3349586

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.