Tutorial 2 Questions
Tutorial 2 Questions
(Missing Data)
Question 5:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians
given medical details. Download the Pima-Indians-Diabetes data from
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input
variables and 1 output variable. The variable names are as follows:
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).
(a) Print the summary statistics of this data set.
(b) Count the number of “0” entries in columns [1,2,3,4,5].
(c) Replace these “0” values by “NaN”.
(Hint: you might need the “.describe()” and “.replace(0, numpy.NaN)” functions “from pandas
import read_csv”.)
Disease Outbreak Response System Condition (DORSCON) in Singapore is a colour-coded framework that shows the
current disease situation. The framework provides us with general guidelines on what needs to be done to prevent and
reduce the impact of infections. There are 4 statuses – Green, Yellow, Orange and Red, depending on the severity and
spread of the disease. Which type of data does DORSCON belong to ?
(1) Categorical; (2) Ordinal; (3) Continuous; (4) Interval
(In Quiz and Exam format)
Question 7:
A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the
maximum, _BLANK1_, and the first and third quartiles, where the number of data points that fall between the first and
third quartiles amounts to _BLANK2_ percent of the total number of data on display.