ML For Health
ML For Health
ML For Health
COMP90089
Machine Learning Applications for Health
Team 8
1
Abstract
The study predicts the mortality risk of ICU patients with ventilator-associated
pneumonia (VAP) from the MIMIC-IV database. Early indicators for predicting
patient survival were identified using appropriate cohort selection and data pre-
processing methods. The study considered early vital signs and laboratory mea-
surements collected across all ICU stays, along with the patient’s demographic
information. It comprehensively describes the timeline considered for each of the
indicators. The aim is to provide a practical, stepwise approach to mortality pre-
diction and provide clusters to identify distinct patterns among VAP patients.
The results show how early predictors are crucial in mortality prediction amongst
VAP patients. It will empower healthcare professionals to manage and mitigate
mortality risks associated with VAP in the ICU proactively, ultimately improving
patient care and outcomes.
1 Introduction
Patients in the intensive care unit (ICU) are at risk of both critical illness-related
mortality and comorbidities, such as hospital-acquired infections. The second most
common of these is ventilator-associated pneumonia (VAP). About 25% of patients
in the ICU encountered VAP-related difficulties during the COVID-19 pandemic.
(Russo et al., 2022).
According to surveys, there are 250,000 to 300,000 cases in the United States each
year, with 5 to 10 cases per 1,000 hospital admissions (Koenig and Truwit, 2006).
VAP patients have an attributed mortality rate between 0 and 50% (Koenig and
Truwit, 2006), indicating that VAP is a disease with high prevalence and high im-
pact.
In this study, we extract, clean, and feature data from the MIMIC database to fore-
cast the likelihood of mortality in VAP-identified patients. We selected early pre-
dictors for survival prediction based on a review of diverse research papers (Zhang
et al., 2022). They not only match with clinical expertise but also capture complex
insights from the VAP research.
We will exploit the power of both Supervised and Unsupervised Machine Learning
(ML) approaches. Supervised ML predicts key mortality risk factors in VAP pa-
tients. We’ll compare results from logistic regression, SVM, XGBoost, and Random
Forest. Unsupervised ML performs clustering tasks to discern distinctions among
the identified salient features.
The study empowers healthcare workers to proactively use early indicators to miti-
gate VAP-related mortality risk.
2 Methods
This section provides an overview of the methodology used for the digital phenotyp-
ing of VAP patients. It describes data preprocessing, feature selection, ML method-
ologies and model evaluation techniques. Finally, we discuss the ethical aspects of
3
the dataset.
• Ventilation Duration: This applies to ICU patients who have been on me-
chanical ventilation for more than 48 hours.
4
For clinical indicators and microbiology events, we considered post-48-hour ventila-
tion session observations until the ICU end time (Figure 1).
A patient is diagnosed with VAP if they fall into either of the two categories.
Category 1:
ICU Patients on ventilation for over 48 hours with specified clinical indicators.
Category 2:
ICU Patients on ventilation for over 48 hours with significant microbiology events.
5
Figure 2: Cohort Identification Flowchart
• During Ventilation:
◦ Check whether the patient received a Bronchoscopy procedure in their
last ICU stay.
6
• Post Ventilation:
◦ Average BMI, WBC count, and body temperature from post-ventilation
to the end of the ICU stay.
Along with the above features, we have also considered the following:
A detailed timeline of the features is presented in the following diagram (Figure 3).
7
Figure 4: Reduced Cohort
8
model (Table 2).
We have used the Recursive Feature Elimination (RFE) algorithm to get the top
15 scaled features identified by the logistic regression, SVM, Random Forest, and
XGBoost estimators.
9
Figure 6: Feature Significance
10
Figure 7: Machine Learning Approaches
To address this, we focus on the F1-score, AUC-ROC, and MCC. These metrics are
more appropriate for evaluating the model’s ability to accurately predict the ‘death’
category while considering the class imbalance, which is crucial in this medical con-
text.
11
Figure 8: Class Distribution of Patient Outcomes
2 · TP
F 1 score = (1)
2 · TP + FP + FN
MCC considers true and false positives and negatives, providing a reliable measure
of model performance that’s unaffected by class size differences.
(T P × T N ) − (F P × F N )
M CC = p (2)
(T P + F P )(T P + F N )(T N + F P )(T N + F N )
12
3 Results
The results section comprises three key segments. We start with Exploratory Data
Analysis to understand the cohort. Next, we evaluate various classification models
and perform feature importance analysis. Finally, we present clustering outcomes
to identify patient subgroups. Data is divided into 80% training and 20% testing
sets to assess model performance.
Mortality Information %
Males Who Died 32.78 %
Females Who Died 36.12 %
13
3.1.2 Bronchoscopy’s Impact
Bronchoscopy had a promising 67% survival rate, with survivors averaging 58.8
years, notably younger than non-survivors, who had an average age of 64.4 years.
The AUC value of 0.71 indicates that the SVM model has reasonably good discrim-
inative power in distinguishing between survival and mortality outcomes (Figure
9). An F1 score of 0.56 and Sensitivity of 0.70 indicate the model’s strength in
identifying mortality cases.
14
Figure 9: ROC-AUC Comparison of Different Models
15
3.3.1 Early Indicators
Elevated Anion Gap and Alkaline Phosphatase levels significantly increase VAP
mortality risk, while Creatinine, Eosinophils, Basophils, and Free Calcium levels
have a less pronounced impact. Lower Chloride levels also modestly contribute to
increased VAP mortality risk.
16
Table 6: Cluster Analysis summary
The key observations are listed in the (Table 6). It depicts how the early indicators
results affect the mortality and how the bronchoscopy procedure might reduce the
mortality of the VAP patients (Figure 12). Box plots are provided for further com-
parisons for all feature categories (Figure 13 ,14 ,15).
17
Figure 14: Box Plots for General Information grouped by Clusters
Using classification models for VAP patients, we explored the influence of early in-
dicators on mortality risk and identified the most critical early warning signs. In
particular, we found that Anion Gap, Chloride, and Creatinine had a high impact
on patient outcomes.
When the anion gap increases, there is likely to be an acid-base imbalance that is
associated with VAP. High Creatinine levels can suggest kidney injury associated
18
with VAP, while high chloride levels can indicate fluid status. The relevance of these
markers in VAP management has been demonstrated in numerous studies investi-
gating their diagnostic and prognostic capabilities (Dafal et al., 2021; Pfortmueller
et al., 2018; Murugan et al., 2010). Critical care can be significantly improved by
applying this knowledge, potentially reducing fatality rates.
There are, however, challenges and limitations associated with the development of
machine learning models for predicting VAP patient mortality. A limitation of the
MIMIC dataset is that it lacks data that specify the exact time at which the patient
contracted VAP. Additionally, the specific nature of the MIMIC dataset may restrict
our ability to provide a universally applicable solution.
Furthermore, our study relied solely on one source of data, which may limit the
generalizability of our findings. Due to the retrospective nature of the study, it is
difficult to assess how this algorithm might impact patient outcomes in a real clinical
setting. Future research needs to be prospectively validated to guarantee that the
findings can be applied to clinical practice in a meaningful way.
19
References
[1] V. S. Baselski, M. el Torky, J. J. Coalson, and J. P. Griffin. The standardization
of criteria for processing and interpreting laboratory specimens in patients with
suspected ventilator-associated pneumonia. Chest, 102:571S, 1992. doi: 10.
1378/chest.102.5 supplement 1.571s.
[2] Akshay Dafal, Sunil Kumar, Sachin Agrawal, Sourya Acharya, and Apoorva
Nirmal. Admission anion gap metabolic acidosis and its impact on patients in
medical intensive care unit. Journal of Laboratory Physicians, 13(02):107–111,
2021.
[4] Atul A Kalanuria, William Zai, and Marek Mirski. Ventilator-associated pneu-
monia in the icu. Critical Care, 18:208, 2014. doi: 10.1186/cc13775.
[5] Andre C. Kalil, Mark L. Metersky, Michael Klompas, and et al. Management
of adults with hospital-acquired and ventilator-associated pneumonia: 2016
clinical practice guidelines by the infectious diseases society of america and
the american thoracic society. Clinical Infectious Diseases, 63:e61, 2016. doi:
10.1093/cid/ciw353.
[8] Vahid Moosavi. Unlocking pre-trained models for natural language pro-
cessing: A practical guide. Scientific Data, 9:1–12, 2022. doi: 10.1038/
s41597-022-01899-x.
20
[10] Carmen Andrea Pfortmueller, Dominik Uehlinger, Stephan von Haehling, and
Joerg Christian Schefold. Serum chloride levels in critical illness—the hidden
story. Intensive care medicine experimental, 6:1–14, 2018.
[12] Norman Wimberley, Leonard J Faling, and John G Bartlett. A fiberoptic bron-
choscopy technique to obtain uncontaminated lower airway secretions for bac-
terial culture. American Review of Respiratory Disease, 119:337, 1979. doi:
10.1164/arrd.1979.119.3.337.
[13] Lu Zhang, Shuang Li, Shuai Yuan, Xiaojie Lu, Jie Li, Yang Liu, Ting Huang,
Jie Lyu, and Hongjun Yin. The association between bronchoscopy and the
prognoses of patients with ventilator-associated pneumonia in intensive care
units: A retrospective study based on the mimic-iv database. Frontiers in
pharmacology, 13:868920, 2022. doi: 10.3389/fphar.2022.868920.
21
Appendix 1
Team Contributions
Digital Phenotyping
Key terms related to digital phenotyping
Ventilator-Associated Pneumonia (VAP) Definition for Phenotyping:
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 1/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Step 1: First we find the patients with ventilation more than 48 hours
Step 1a: To Find Patients with ICU Stays > 48 Hours
Step Description: In this step, we focus on identifying patients who have spent more than 48 hours in the intensive care unit (ICU).
Methodology:
1. We begin by considering patients in the ICU setting.
2. The primary objective is to distinguish patients who have had ICU stay of more than 48 hours.
A patient undergoes ventilation inbetween the ICU stays. Hence, to get the patients with more than 48 hrs of ventilation, we first find
patients with ICU time greater than 48 hours.
In [4]: # Query to find patients with ICU stays greater than 48 hours
WITH VentilationSessions AS (
SELECT
pe.subject_id, -- Patient identifier
pe.starttime AS vent_starttime, -- Start time of the procedure
pe.endtime AS vent_endtime, -- End time of the procedure
vid.label AS ventilation_type, -- Label description for the ventilation item
CASE
WHEN pe.value >= 2880 THEN pe.value -- 48hours has 2880 mins
ELSE NULL
END AS vent_duration, -- Duration of the procedure (null for values < 2880)
pe.valueuom AS duration_unit, -- Unit of measurement for duration
pe.patientweight, -- Patient's weight at the time of the procedure
ROW_NUMBER() OVER (PARTITION BY pe.subject_id ORDER BY pe.starttime) AS vent_seq_no,
COUNT(*) OVER (PARTITION BY pe.subject_id) AS total_vent_stays
FROM `physionet-data.mimiciv_icu.procedureevents` AS pe
JOIN (
SELECT itemid, label
FROM `physionet-data.mimiciv_icu.d_items`
WHERE LOWER(label) LIKE '%ventilation%'
) AS vid ON vid.itemid = pe.itemid
WHERE pe.statusdescription LIKE 'FinishedRunning'
ORDER BY pe.subject_id, pe.starttime)
SELECT subject_id,
vent_starttime,
vent_endtime,
ventilation_type,
vent_duration,
duration_unit,
patientweight,
vent_seq_no,
total_vent_stays
FROM VentilationSessions
WHERE vent_duration IS NOT NULL
""")
patients_with_vent_session.head()
Out[5]: subject_id vent_starttime vent_endtime ventilation_type vent_duration duration_unit patientweight vent_seq_no total_vent_stays
0 10001884 2131-01-13 2131-01-19 Invasive 9465.0 min 65.0 3 4
04:00:00 17:45:00 Ventilation
1 10001884 2131-01-15 2131-01-19 Invasive 6576.0 min 65.0 4 4
04:07:00 17:43:00 Ventilation
2 10002428 2156-04-19
20:10:00
2156-04-22
17:05:00
Invasive
Ventilation 4135.0 min 43.0 1 3
3 10002428 2156-05-11 2156-05-20 Invasive 12640.0 min 48.4 3 3
16:05:00 10:45:00 Ventilation
4 10003400 2137-02-25
23:37:00
2137-02-28
14:17:00
Invasive
Ventilation 3760.0 min 93.0 1 4
Step Description: In this step, we focus on filtering the patients who have had ICU stays greater than 48hrs AND ventilation sessions
greater than 48 hours
Methodology:
1. We combined the two dataframes obtained in step 1a and step 1b.
2. Intuitively, the ventilation stay should start between the ICU stays. But, there are some patients whose ventilation started some minutes
before their ICU intime. Hence, i have added a delta of 15 mins
Result:
1. The resulting table combines information about patients' ICU stays and ventilation sessions. For each ICU stay lasting more than 48
hours, the table will include rows corresponding to all ventilation sessions that lasted more than 48 hours during that specific ICU stay.
In [6]: #Perform inner join
result_df = pd.merge(icu_admission_table, patients_with_vent_session, on='subject_id', how='inner')
In [7]: # filter
ventilation_and_icu_data = result_df[
(result_df['vent_starttime'] >= result_df['icu_intime'] - pd.Timedelta(minutes=15)) & (result_df['vent_endtime'] <
End of Step 1
In this step, we have identified a total of 8034 patients who met the following criteria:
They had ICU stays lasting more than 48 hours.
During these extended ICU stays, these patients also had ventilation sessions that lasted more than 48 hours.
WITH TempInfo AS (
SELECT itemid, label
FROM `physionet-data.mimiciv_icu.d_items`
WHERE LOWER(label) LIKE '%temperature celsius%'
OR LOWER(label) LIKE '%temperature fahrenheit%'
OR LOWER(label) LIKE '%temperaturef_apacheiv%'
)
SELECT
lb.subject_id, -- Patient identifier
lb.charttime, -- Vital sign chart time
lb.valuenum , -- Vital sign value number
lb.valueuom -- UoM
FROM
`physionet-data.mimiciv_icu.chartevents` AS lb,
TempInfo as ti
"""
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 4/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Execute the query and retrieve patient demographic information.
patients_with_fever = run_query(query)
Filter the tests to include only those performed after 48 hours of the ventilation session and before the respective ICU end time
associated with that ventilation session.
In [11]: result_df = pd.merge(ventilation_and_icu_data, patients_with_fever, on='subject_id', how='inner')
<ipython-input-13-a4ee555a8c4e>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [14]: patients_with_fever_after_48hr_ventilation.head()
Out[14]: subject_id icu_seq total_icu_stays icu_intime icu_outtime icu_los vent_seq_no total_vent_stays vent_starttime vent_endtime ven
4 10002428 2 4 192156-04-
18:11:19
2156-04-26 7.032894
18:58:41 1 3 2156-04-19
20:10:00
2156-04-22
17:05:00
2156-05- 2156-05-22 10.977222 2156-05-11 2156-05-20
19 10002428 4 4 11 14:16:46 3 3 16:05:00 10:45:00
14:49:34
2196-02- 2196-02-29 4.952106 2196-02-24 2196-02-27
20 10004235 1 1 24 15:58:02 1 1 16:52:00 16:28:00
17:07:00
2144-04- 2144-05-01 9.711146 2144-04-21 2144-05-01
64 10004401 5 7 21 13:53:03 6 8 21:23:00 13:53:00
20:49:00
2144-04- 2144-05-01 9.711146 2144-04-21 2144-05-01
68 10004401 5 7 21 13:53:03 6 8 21:23:00 13:53:00
20:49:00
4354
Out[16]:
WITH WBCInfo AS (
SELECT itemid
FROM `physionet-data.mimiciv_hosp.d_labitems`
WHERE LOWER(label) LIKE '%white blood cells%'
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 5/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
)
SELECT
lb.subject_id, -- Patient identifier
lb.charttime AS WBC_charttime, -- Vital sign chart time
lb.valuenum AS WBC_count , -- Vital sign value number
lb.valueuom AS WBC_scale -- UoM
FROM
`physionet-data.mimiciv_hosp.labevents` AS lb,
WBCInfo as wi
Filter the tests to include only those performed after 48 hours of the ventilation session and before the respective ICU end time
associated with that ventilation session.
In [18]: result_df = pd.merge(ventilation_and_icu_data, patients_with_leukocytosis, on='subject_id', how='inner')
3474
Out[20]:
End of Step 2
In this step, we have identified a total of 3474 patients who meet a combination of two key criteria:
1. ICU Stays and Ventilation Sessions: These patients had ICU stays lasting more than 48 hours, and during these extended ICU stays,
they also had ventilation sessions that persisted for more than 48 hours.
AND
1. Clinical Indicator Criteria: We further refined our identification to focus on patients who were diagnosed with specific clinical
indicators, including fever, leukocytosis, and leukopenia. These diagnostic criteria were assessed after the patients' 48-hour ventilation
sessions and before the respective ICU end times for those ventilation sessions.
SELECT
subject_id,
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 6/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
spec_type_desc as path_spec_type_desc,
charttime AS path_charttime,
CASE
WHEN LOWER(spec_type_desc) LIKE '%tracheal aspirate%' THEN 'Endotracheal aspirates – ≥1,000,000 colony forming
WHEN LOWER(spec_type_desc) LIKE '%bronchial brush%' THEN 'Bronchoscopic- or mini-BAL – 10,000 cfu/mL'
WHEN LOWER(spec_type_desc) LIKE '%mini-bal%' THEN 'Bronchoscopic- or mini-BAL – 10,000 cfu/mL'
ELSE NULL
END AS path_comment_data
FROM `physionet-data.mimiciv_hosp.microbiologyevents`
WHERE (
-- Filter spec_type_desc
(
LOWER(spec_type_desc) LIKE '%tracheal aspirate%'
OR LOWER(spec_type_desc) LIKE '%bronchial brush%'
OR LOWER(spec_type_desc) LIKE '%mini-bal%'
)
-- Filter negative comments
AND comments IS NOT NULL
AND LOWER(comments) NOT LIKE '%absent%'
AND comments NOT LIKE '___'
AND comments NOT LIKE '%NO POLYMORPHONUCLEAR LEUKOCYTES SEEN%'
AND comments NOT LIKE '%NO MYCOBACTERIA%'
AND comments NOT LIKE '%NO MICROORGANISMS SEEN.%'
AND (
comments NOT LIKE '%NO FUNGUS ISOLATED%'
AND comments NOT LIKE '%NO ACID FAST BACILLI SEEN ON CONCENTRATED SMEAR%'
AND comments NOT LIKE '%NO GROWTH%'
AND comments NOT LIKE '%NO VIRUS ISOLATED%'
AND comments NOT LIKE '%NO LEGIONELLA ISOLATED%'
AND comments NOT LIKE '%NO ANAEROBES ISOLATED%'
AND comments NOT LIKE '%NO NOCARDIA ISOLATED%'
AND comments NOT LIKE '%No Cytomegalovirus (CMV) isolated%'
AND comments NOT LIKE '%No Herpes simplex (HSV) virus isolated%'
AND comments NOT LIKE '%NO FUNGAL ELEMENTS SEEN%'
)
-- Additional filters for invalid events
AND LOWER(comments) NOT LIKE '%test cancelled%'
AND LOWER(comments) NOT LIKE '%invalid%'
AND LOWER(comments) NOT LIKE '%not detected%'
AND LOWER(comments) NOT LIKE '%unknown amount%'
AND LOWER(comments) NOT LIKE '%no respiratory viruses isolated%'
AND LOWER(comments) NOT LIKE '%rare growth%'
AND LOWER(comments) NOT LIKE '%sparse growth%'
AND LOWER(comments) NOT LIKE '%negative%'
)
and charttime is not null;
"""
Filter the tests to include only those performed after 48 hours of the ventilation session and before the respective ICU end time
associated with that ventilation session.
In [22]: result_df = pd.merge(ventilation_and_icu_data, pulmonary_pathogens_info, on='subject_id', how='inner')
299
Out[24]:
End of Step 3
In this step, we have identified a total of 299 patients who meet a combination of two key criteria:
1. ICU Stays and Ventilation Sessions: These patients had ICU stays lasting more than 48 hours, and during these extended ICU stays,
they also had ventilation sessions that persisted for more than 48 hours.
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 7/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
AND
1. Microbiology Events (bacterial growth): We further refined our identification to focus on patients who surpassed the threshold value
of the pulmonary pathogens. These diagnostic criteria were assessed after the patients' 48-hour ventilation sessions and before the
respective ICU end times for those ventilation sessions.
Final Phenotype Identification
In the final phenotype identification, we are looking for patients who meet one of the following sets of criteria:
1. Mechanical Ventilation Duration and Clinical Indicators: Patients who have a prolonged duration of mechanical ventilation, and they
exhibit specific clinical indicators such as Leukocytosis and Fever.
For this, we have identified 3474 patients.
OR
1. Mechanical Ventilation Duration and Microbiology Events: Patients who have a prolonged duration of mechanical ventilation and
experience significant microbiology events, specifically high values in tests like tracheal aspirates, mini-BAL, or PSB (Protected
Specimen Brush).
For this, we have identified 299 patients.
Next, we will combine the patients obtained from both the condition and that will make our Final phenotype.
In [25]: final_phenotype_patients = list(set(patients_with_abnormal_bacterial_growth_after_48hr_ventilation_distinct_subject_id
patients_with_leukocytosis_and_fever_after_48hr_ventilation_distinct_subject_ids))
In [26]: len(final_phenotype_patients)
3538
Out[26]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1091 entries, 0 to 1090
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 1091 non-null Int64
dtypes: Int64(1)
memory usage: 9.7 KB
# Calculate the total ICU duration for each patient (subject_id) by summing the 'los' column within each group
icu_admission_table_sum['total_icu_duration'] = icu_admission_table.groupby('subject_id')['los'].transform('sum')
# Select and retain only the columns 'subject_id', 'total_icu_stays', and 'total_icu_duration'
icu_admission_table_sum = icu_admission_table_sum[['subject_id', 'total_icu_stays', 'total_icu_duration']]
# Remove duplicate rows based on all columns, keeping one row per unique combination
icu_admission_table_sum = icu_admission_table_sum.drop_duplicates()
In [29]: icu_admission_table_sum.head()
# Calculate the total ICU duration for each patient (subject_id) by summing the 'los' column within each group
patients_with_vent_session_sum['total_vent_duration'] = patients_with_vent_session.groupby('subject_id')['vent_duratio
# Select and retain only the columns 'subject_id', 'total_icu_stays', and 'total_icu_duration'
patients_with_vent_session_sum = patients_with_vent_session_sum[['subject_id', 'total_vent_stays', 'total_vent_duratio
# Remove duplicate rows based on all columns, keeping one row per unique combination
patients_with_vent_session_sum = patients_with_vent_session_sum.drop_duplicates()
In [32]: patients_with_vent_session_sum.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 9/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Out[33]: subject_id vent_starttime vent_endtime ventilation_type vent_duration duration_unit patientweight vent_seq_no total_vent_stays
0 10001884 2131-01-13 2131-01-19 Invasive 9465.0 min 65.0 3 4
04:00:00 17:45:00 Ventilation
1 10001884 2131-01-15 2131-01-19 Invasive 6576.0 min 65.0 4 4
04:07:00 17:43:00 Ventilation
In [36]: # Filter the DataFrame to select rows with subject IDs in 'final_phenotype_patients'
ventilation_and_icu_data_sum = result_df[result_df['subject_id'].isin(final_phenotype_patients)]
In [37]: ventilation_and_icu_data_sum.head()
In [38]: ventilation_and_icu_data_sum.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3538 entries, 3 to 8359
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 3538 non-null Int64
1 total_icu_stays 3538 non-null Int64
2 total_icu_duration 3538 non-null float64
3 total_vent_stays 3538 non-null Int64
4 total_vent_duration 3538 non-null float64
dtypes: Int64(3), float64(2)
memory usage: 176.2 KB
In [41]: patients_with_fever_sum.head()
In [44]: patients_with_leukocytosis_sum.head()
The avg white blood cell (WBC) indicates the average WBC of the patient after 48 hours of the ventilation
session and before the respective ICU end time associated with that ventilation session.
In [45]: result_df = pd.merge(patients_with_leukocytosis_sum, patients_with_fever_sum, on='subject_id', how='inner')
In [47]: step1_2_patient_info.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 11/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
In [51]: step1_2_3_patient_info.head()
In [53]: final_digital_phenotype.head()
3538
Out[54]:
True
Out[55]:
Feature Selection
Risk Assessment Factors
A. Patient Characteristics
1. Average BMI (Body Mass Index): Average BMI before the ventilation of 48 hrs started.
1a. First we find all the height and weight measurement of the patient before the start of its 48 hrs ventilation session.
NOTE: We are considering the average of weight that was taken before each ventilation stay
1b. Then we find the average BMI
2. Gender
3. Age
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 12/51
B. Outcome Variable
28/10/2023, 20:56 COMP90089_JupyterNotebook
1. Date of Death: This is the variable we aim to predict or assess, representing whether the patient passed away or not.
C. Medical Data
1. Lab Test Results - Average of the lab results taken between the ICU intime and before the first ventillation session for that ICD
sequence.
2. Vital Test Results - Average of the vital results taken between the ICU intime and before the first ventillation session for that ICD
sequence.
D. Patient History
1. Co-morbidity Index: This factor represents the patient's pre-existing medical conditions and co-morbidities when they were initially
admitted to the first ICU stay. It helps us understand the patient's overall health condition.
E. Bronschopy
1. Bronchoscopy Procedure: Check if the patient has received Bronchoscopy Procedure during its ventilation session (that lasted for
more than 48 hours).
A. Patient Characteristics
BMI Value
1. First, we will find the height of the patients
2. Next, we will use the weight we got in our ventilation_and_icu_data table
In [56]: query = f"""
WITH ht_in AS (
SELECT
c.subject_id, c.stay_id, c.charttime
-- Ensure that all heights are in centimeters
, ROUND(CAST(c.valuenum * 2.54 AS NUMERIC), 2) AS height
, c.valuenum AS height_orig
FROM `physionet-data.mimiciv_icu.chartevents` c
WHERE c.valuenum IS NOT NULL
-- Height (measured in inches)
AND c.itemid = 226707
)
, ht_cm AS (
SELECT
c.subject_id, c.stay_id, c.charttime
-- Ensure that all heights are in centimeters
, ROUND(CAST(c.valuenum AS NUMERIC), 2) AS height
FROM `physionet-data.mimiciv_icu.chartevents` c
WHERE c.valuenum IS NOT NULL
-- Height cm
AND c.itemid = 226730
)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 13/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Display the first few rows of the result.
patient_height_info.head()
In [61]: # Calculate the average height and weight for each patient
patient_avg_bmi_before_vent_sum['avg_ht_cm'] = patient_avg_bmi_before_vent.groupby('subject_id')['height_cm'].transfor
patient_avg_bmi_before_vent_sum['avg_wt_kg'] = patient_avg_bmi_before_vent.groupby('subject_id')['patientweight'].tran
patient_avg_bmi_before_vent_sum['avg_ht_cm'] = patient_avg_bmi_before_vent_sum['avg_ht_cm'].round(3)
patient_avg_bmi_before_vent_sum = patient_avg_bmi_before_vent_sum.drop_duplicates()
patient_avg_bmi_before_vent_sum.reset_index(inplace = True, drop = True)
In [62]: patient_avg_bmi_before_vent_sum.head()
SELECT
subject_id,
gender,
anchor_age AS age,
CASE
WHEN dod IS NOT NULL THEN 1
ELSE 0
END AS death_flag
FROM `physionet-data.mimic_core.patients`
WHERE anchor_age <> 0
AND subject_id IN ({', '.join(map(str, final_phenotype_patients))}) -- Filter for specified subject IDs
"""
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 14/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Medical Data
Lab Test Results
Average of the lab results taken between the ICU intime and before the first ventillation session for that ICD sequence.
In [64]: ###DO NOT RUN. I have already taken the CSV for this.
query = f"""
WITH lab_events_Hematology AS (
SELECT itemid, label, fluid, category
FROM `physionet-data.mimiciv_hosp.d_labitems`
WHERE
LOWER(label) LIKE '%white blood cell%'
OR LOWER(label) LIKE 'neutrophils'
OR (LOWER(label) LIKE 'lymphocytes' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'basophils' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'eosinophils' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'monocytes' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR LOWER(label) LIKE '%red blood cells%'
OR (LOWER(label) LIKE 'hematocrit' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'hemoglobin' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'mcv' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'mch' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'mchc' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE 'rdw' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
OR (LOWER(label) LIKE '%platelets%' AND fluid LIKE 'Blood' AND category LIKE 'Hematology')
-- Electrolytes and Blood Gases
OR LOWER(label) LIKE 'anion gap'
OR LOWER(label) LIKE 'bicarbonate'
OR LOWER(label) LIKE 'total calcium'
OR (LOWER(label) LIKE 'free calcium' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
OR LOWER(label) LIKE 'chloride'
OR LOWER(label) LIKE 'sodium'
OR LOWER(label) LIKE 'potassium'
OR (LOWER(label) LIKE 'base excess' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
OR (LOWER(label) LIKE 'ph' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
OR (LOWER(label) LIKE 'pco2' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
OR (LOWER(label) LIKE 'po2' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
-- Metabolic Markers
OR (LOWER(label) LIKE 'lactate' AND fluid LIKE 'Blood' AND category LIKE 'Blood Gas')
OR LOWER(label) LIKE 'creatinine'
OR LOWER(label) LIKE 'urea nitrogen'
OR (LOWER(label) LIKE 'glucose' AND fluid LIKE 'Urine' AND category LIKE 'Hematology')
OR LOWER(label) LIKE 'sct - normalized ratio'
-- Coagulation
OR LOWER(label) LIKE 'pt'
OR LOWER(label) LIKE 'ptt'
-- Liver Function
OR LOWER(label) LIKE '%alanine aminotransferase (alt)'
OR LOWER(label) LIKE 'alkaline phosphatase'
OR LOWER(label) LIKE '%lactate dehydrogenase (ld)'
OR LOWER(label) LIKE 'bilirubin, total'
OR LOWER(label) LIKE 'albumin'
SELECT subject_id,
charttime,
label,
valuenum,
valueuom,
flag
FROM
`physionet-data.mimiciv_hosp.labevents` AS lb,
lab_events_Hematology AS leh
WHERE lb.itemid = leh.itemid
AND lb.subject_id IN ({', '.join(map(str, final_phenotype_patients))}) -- Filter for specified subject IDs
ORDER BY label, valueuom;
"""
lab_results_all = run_query(query)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 15/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
lab_results_all.head()
In [ ]: # csv_filename = 'lab_results_all.csv'
# lab_results_all.to_csv(csv_filename, index=True)
In [ ]: lab_results_all = pd.read_csv('/content/lab_results_all.csv')
In [66]: #Average of the lab results taken between the ICU intime and before the first ventillation session for that ICD sequen
patients_lab_results_before_ventilation_session = result_df[
(result_df['charttime'] >= result_df['icu_intime'] )&
(result_df['charttime'] <= result_df['vent_starttime'])
]
patients_lab_results_before_ventilation_session.rename(columns=column_mapping, inplace=True)
#Taking the first ventilation session information for the ICU sequence
patients_lab_results_before_ventilation_session.sort_values(by=['subject_id', 'icu_seq', 'vent_seq_no', 'lab_test' ],
patients_lab_results_before_ventilation_session = patients_lab_results_before_ventilation_session.groupby(['subject_id
In [69]: patients_lab_results_before_ventilation_session.head()
36
Out[70]:
query = f"""
WITH lab_events_Hematology AS (
SELECT itemid, label, fluid, category
FROM `physionet-data.mimiciv_hosp.d_labitems`
WHERE
label IN ({', '.join([f"'{test}'" for test in distinct_lab_tests])})
)
SELECT distinct label, ref_range_lower, ref_range_upper
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 16/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
FROM
`physionet-data.mimiciv_hosp.labevents` AS lb,
lab_events_Hematology AS leh
WHERE lb.itemid = leh.itemid
AND ref_range_upper <> 0
AND (ref_range_upper IS NOT NULL)
AND (ref_range_lower IS NOT NULL)
"""
reference_range_info = run_query(query)
reference_range_info.head()
We take the average of the lower and upper range, if we have multiple of these rows.
In [72]: reference_range_info.sort_values(by=['label'], inplace=True)
reference_range_info = reference_range_info.groupby(['label']).mean().reset_index()
reference_range_info = reference_range_info.round(3)
In [73]: reference_range_info.head()
Now that we have our range, we append patients_lab_results_before_ventilation_session with the flag value
In [74]: # Merge the DataFrames based on the 'lab_test' column
patients_lab_results_before_ventilation_session = patients_lab_results_before_ventilation_session.merge(reference_rang
patients_lab_results_before_ventilation_session = patients_lab_results_before_ventilation_session.drop(
patients_lab_results_before_ventilation_session[patients_lab_results_before_ventilation_session['ref_range_lower']
# Drop the extra 'label' column
patients_lab_results_before_ventilation_session = patients_lab_results_before_ventilation_session.drop(['label','ref_r
34
Out[78]:
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 17/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
WITH vital_signs_selected AS (
SELECT itemid, label, category, unitname
FROM `physionet-data.mimiciv_icu.d_items`
WHERE LOWER(label) LIKE '%heart rate%'
OR LOWER(label) LIKE '%arterial pressure%'
OR LOWER(label) LIKE '%respiratory rate (total)%'
OR LOWER(label) LIKE '%temperature celsius%'
OR LOWER(label) LIKE '%temperature f%'
OR LOWER(label) LIKE '%spo2%'
OR LOWER(label) LIKE '%urine output%'
)
FROM `physionet-data.mimiciv_icu.chartevents` AS lb
JOIN vital_signs_selected ON vital_signs_selected.itemid = lb.itemid
AND lb.subject_id IN ({', '.join(map(str, final_phenotype_patients))}) -- Filter for specified subject IDs
"""
vitals_results_all = run_query(query)
vitals_results_all.head()
In [ ]: csv_filename = 'vitals_results_all.csv'
vitals_results_all.to_csv(csv_filename, index=True)
In [ ]: vitals_results_all = pd.read_csv('/content/vitals_results_all.csv')
In [82]: #Average of the lab results taken between the ICU intime and before the first ventillation session for that ICD sequen
patients_vital_results_before_ventilation_session = result_df[
(result_df['Vital_Test_charttime'] >= result_df['icu_intime'] )&
(result_df['Vital_Test_charttime'] <= result_df['vent_starttime'])
]
In [84]: patients_vital_results_before_ventilation_session[patients_vital_results_before_ventilation_session['subject_id'] == 1
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 18/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
In [88]: patients_vital_results_before_ventilation_session.head(5)
Bronschopy Procedure
In [89]: query = f"""
-- Define bronchoscopy items
WITH bronchoscopy_info AS (
SELECT * FROM `physionet-data.mimiciv_icu.d_items`
WHERE LOWER(label) LIKE 'bronchoscopy'
)
SELECT
pe.subject_id, -- Patient identifier
pe.starttime, -- Bronchoscopy start time
pe.endtime, -- Bronchoscopy end time
bi.label, -- Bronchoscopy label
FROM
`physionet-data.mimiciv_icu.procedureevents` AS pe
JOIN bronchoscopy_info AS bi ON bi.itemid = pe.itemid
AND pe.statusdescription LIKE 'FinishedRunning'
AND subject_id IN ({', '.join(map(str, final_patient_cohort_subject_ids))}) -- Filter for specified subject IDs
"""
bronchoscopy_patient_info = run_query(query)
bronchoscopy_patient_info.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 19/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
In [91]: result_df.head()
Out[91]: subject_id icu_seq total_icu_stays icu_intime icu_outtime los vent_seq_no total_vent_stays vent_starttime vent_endtime vent
2144-01- 2144-02-06 10.636238
0 10004401 1 7 26 13:44:15 1 8 2144-01-30
13:00:00
2144-02-03
08:01:00
22:28:04
2144-01- 2144-02-06 10.636238 2144-01-30 2144-02-03
1 10004401 1 7 26 13:44:15 1 8 13:00:00 08:01:00
22:28:04
2144-01- 2144-02-06 10.636238 2144-01-30 2144-02-03
2 10004401 1 7 26 13:44:15 1 8 13:00:00 08:01:00
22:28:04
2144-01- 2144-02-06 10.636238 2144-01-30 2144-02-03
3 10004401 1 7 26 13:44:15 1 8 13:00:00 08:01:00
22:28:04
2144-02- 2144-02-19 6.843345 2144-02-12 2144-02-16
4 10004401 2 7 12 14:42:24 2 8 19:00:00 10:35:00
18:27:59
1485
Out[93]:
In [94]: data = {
'subject_id': distinct_subject_ids_1,
'bronchoscopy': 1
}
patients_with_bronchoscopy = pd.DataFrame(data)
In [95]: patients_with_bronchoscopy.head()
Charlson-Comorbidity Index
In [96]: patient_comorbidity_index = """
WITH diag AS (
SELECT
hadm_id
, CASE WHEN icd_version = 9 THEN icd_code ELSE NULL END AS icd9_code
, CASE WHEN icd_version = 10 THEN icd_code ELSE NULL END AS icd10_code
FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
)
, com AS (
SELECT
ad.hadm_id,
-- Malignant Neoplasm
MAX(CASE
WHEN (icd10_code LIKE 'V109%') THEN 2
ELSE 0 END) AS malignancy,
-- Acute Myocardial Infarction
MAX(CASE
WHEN (icd9_code LIKE '410%' OR icd10_code LIKE 'I21%') THEN 1
ELSE 0 END) AS mi,
-- Congestive Heart Failure
MAX(CASE
WHEN (icd9_code LIKE '39891%' OR icd10_code LIKE 'I50%') THEN 1
ELSE 0 END) AS chf,
-- Cerebrovascular Disease
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 20/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
MAX(CASE
WHEN (icd9_code LIKE '43401%' OR icd9_code LIKE '43411%'
OR icd9_code LIKE '43491%' OR icd10_code LIKE 'I63%') THEN 1
ELSE 0 END) AS cerebrovascular_disease,
-- Chronic Pulmonary Disease
MAX(CASE
WHEN (icd9_code LIKE '496%' OR icd10_code LIKE 'I279%'
OR icd10_code LIKE 'J44%') THEN 1
ELSE 0 END) AS chronic_pulmonary_disease,
-- Liver Disease
MAX(CASE
WHEN (icd9_code LIKE '571%'
OR (icd10_code LIKE 'K7%' AND icd10_code >= 'K70' AND icd10_code <= 'K77')) THEN 1
ELSE 0 END) AS liver_disease,
-- Kidney Disease
MAX(CASE
WHEN (icd10_code LIKE 'N18%' OR icd9_code LIKE '585%') THEN 2
ELSE 0 END) AS kidney_disease,
-- Diabetes
MAX(CASE
WHEN (icd9_code LIKE '250%'
OR (icd10_code LIKE 'E10%' AND icd10_code >= 'E10' AND icd10_code <= 'E11')) THEN 1
ELSE 0 END) AS diabetes
FROM `physionet-data.mimiciv_hosp.admissions` ad
LEFT JOIN diag
ON ad.hadm_id = diag.hadm_id
GROUP BY ad.hadm_id
)
, ag AS (
SELECT
ad.hadm_id,
ad.subject_id,
anchor_age + (EXTRACT(YEAR FROM ad.admittime) - anchor_year) AS age,
CASE
WHEN (anchor_age + (EXTRACT(YEAR FROM ad.admittime) - anchor_year)) <= 50 THEN 0
WHEN (anchor_age + (EXTRACT(YEAR FROM ad.admittime) - anchor_year)) <= 60 THEN 1
WHEN (anchor_age + (EXTRACT(YEAR FROM ad.admittime) - anchor_year)) <= 70 THEN 2
WHEN (anchor_age + (EXTRACT(YEAR FROM ad.admittime) - anchor_year)) <= 80 THEN 3
ELSE 4
END AS age_score
FROM
`physionet-data.mimiciv_hosp.admissions` ad
JOIN
`physionet-data.mimiciv_hosp.patients` p
ON
ad.subject_id = p.subject_id
)
SELECT
ad.subject_id,
malignancy + mi + chf + cerebrovascular_disease + chronic_pulmonary_disease
+ liver_disease + kidney_disease + diabetes
AS charlson_comorbidity_index
FROM `physionet-data.mimiciv_hosp.admissions` ad
LEFT JOIN com
ON ad.hadm_id = com.hadm_id
LEFT JOIN ag
ON com.hadm_id = ag.hadm_id
;
"""
patient_comorbidity_index_info = run_query(patient_comorbidity_index)
patient_comorbidity_index_info.head()
62989
Out[98]:
In [99]: final_digital_phenotype.info()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 21/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3540 entries, 0 to 3539
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 3540 non-null Int64
1 total_icu_stays 3540 non-null Int64
2 total_icu_duration 3540 non-null float64
3 total_vent_stays 3540 non-null Int64
4 total_vent_duration 3540 non-null float64
5 avg_WBC_count 3476 non-null float64
6 avg_body_temp 3476 non-null float64
7 path_spec_type_desc 301 non-null object
8 path_comment_data 301 non-null object
dtypes: Int64(3), float64(4), object(2)
memory usage: 286.9+ KB
In [101… result_df.head()
FINAL PHENOTYPE
In [104… final_digital_phenotype.head()
In [105… patient_avg_bmi_before_vent_sum.head()
In [106… patient_general_info.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 22/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
In [107… patients_lab_results_before_ventilation_session.head()
In [108… patients_vital_results_before_ventilation_session.head()
In [109… patients_with_bronchoscopy.head()
In [110… patient_comorbidity_index_info.head()
In [113… column_mapping = {
'flag': 'lab_test_flag'
}
result_df_6.rename(columns=column_mapping, inplace=True)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 23/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Data Pre-processing
Drop Null values
In [120… #dropping columns that got created while dataframe was saved as a csv file
identified_final_cohort.drop(['Unnamed: 0'], axis = 1, inplace = True)
In [122… result_df.head()
Out[122]: subject_id total_icu_stays total_icu_duration total_vent_stays total_vent_duration avg_WBC_count avg_body_temp avg_bmi gender
0 10004235 1 4.952 1 2.983 15.100 38.500 29.860551 M
1 10004401 7 53.950 8 43.558 9.288 38.222 31.149509 M
In [125… result_df.head()
Out[125]: subject_id total_icu_stays total_icu_duration total_vent_stays total_vent_duration avg_WBC_count avg_body_temp avg_bmi gender
0 10004401 7 53.95 8 43.558 9.288 38.222 31.149509 M
1 10004401 7 53.95 8 43.558 9.288 38.222 31.149509 M
2 10004401 7 53.95 8 43.558 9.288 38.222 31.149509 M
3 10004401 7 53.95 8 43.558 9.288 38.222 31.149509 M
4 10004401 7 53.95 8 43.558 9.288 38.222 31.149509 M
LAB TEST
In [126… lab_info_df = result_df.copy()
# If you want to replace NaN with None, you can use the following line:
lab_info_df = lab_info_df.where(lab_info_df.notna(), None)
In [131… lab_info_df.info()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 24/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290 entries, 0 to 2289
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 2290 non-null Int64
1 Alanine Aminotransferase (ALT) 1393 non-null float64
2 Albumin 1099 non-null float64
3 Alkaline Phosphatase 1386 non-null float64
4 Anion Gap 1961 non-null float64
5 Basophils 1211 non-null float64
6 Bicarbonate 1963 non-null float64
7 Bilirubin, Total 1383 non-null float64
8 Chloride 1966 non-null float64
9 Creatinine 1965 non-null float64
10 Eosinophils 1211 non-null float64
11 Free Calcium 1635 non-null float64
12 Glucose 164 non-null float64
13 Hematocrit 2041 non-null float64
14 Hemoglobin 2034 non-null float64
15 Lactate 1953 non-null float64
16 Lactate Dehydrogenase (LD) 1077 non-null float64
17 Lymphocytes 1211 non-null float64
18 MCH 2033 non-null float64
19 MCHC 2034 non-null float64
20 MCV 2033 non-null float64
21 Monocytes 1211 non-null float64
22 Neutrophils 1211 non-null float64
23 PT 1938 non-null float64
24 PTT 1928 non-null float64
25 Potassium 1963 non-null float64
26 RDW 2032 non-null float64
27 Red Blood Cells 2033 non-null float64
28 SCT - Normalized Ratio 1 non-null float64
29 Sodium 1962 non-null float64
30 Urea Nitrogen 1970 non-null float64
31 White Blood Cells 2033 non-null float64
32 pCO2 2072 non-null float64
33 pH 2080 non-null float64
34 pO2 2073 non-null float64
dtypes: Int64(1), float64(34)
memory usage: 628.5 KB
VITAL INFO
In [132… vital_info_df = result_df.copy()
# If you want to replace NaN with None, you can use the following line:
vital_info_df = vital_info_df.where(vital_info_df.notna(), None)
In [136… vital_info_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290 entries, 0 to 2289
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 2290 non-null Int64
1 Heart Rate 2224 non-null float64
2 Respiratory Rate (Total) 1947 non-null float64
3 Temperature Celsius 303 non-null float64
4 Temperature Fahrenheit 1981 non-null float64
5 Urine output_ApacheIV 1 non-null float64
dtypes: Int64(1), float64(5)
memory usage: 109.7 KB
Comobidity Index
In [137… result_df.drop(['lab_test', 'lab_test_value', 'Vital_Test', 'Vital_Test_Value'], axis=1, inplace = True)
In [140… result_df.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 25/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Out[140]: subject_id total_icu_stays total_icu_duration total_vent_stays total_vent_duration avg_WBC_count avg_body_temp avg_bmi gender
0 10004401 7 53.950 8 43.558 9.288 38.222 31.149509 M
1 10005606 1 6.595 1 3.778 16.186 38.250 26.543366 M
2 10005817 2 18.332 2 13.226 13.257 38.685 29.330163 M
3 10005817 2 18.332 2 13.226 13.257 38.685 29.330163 M
4 10007818 1 20.529 1 20.174 18.686 38.056 25.186267 M
In [142… cmob_result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290 entries, 0 to 2289
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 subject_id 2290 non-null Int64
1 charlson_comorbidity_index 1628 non-null Float64
dtypes: Float64(1), Int64(1)
memory usage: 40.4 KB
In [146… preprocessed_cohort.head()
Out[146]:
subject_id total_icu_stays total_icu_duration total_vent_stays total_vent_duration avg_WBC_count avg_body_temp avg_bmi gender
0 10004401 7 53.950 8 43.558 9.288 38.222 31.149509 M
1 10005606 1 6.595 1 3.778 16.186 38.250 26.543366 M
2 10005817 2 18.332 2 13.226 13.257 38.685 29.330163 M
3 10007818 1 20.529 1 20.174 18.686 38.056 25.186267 M
4 10011365 1 8.703 2 7.390 12.750 38.683 18.783723 F
5 rows × 52 columns
In [ ]: csv_filename = 'preprocessed_cohort.csv'
preprocessed_cohort.to_csv(csv_filename, index=True)
In [ ]: preprocessed_cohort = pd.read_csv('preprocessed_cohort.csv')
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 26/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
print("\nColumns with More Than 40% Null Values:")
print(columns_with_high_null)
Null Counts:
total_icu_stays 0
total_icu_duration 0
total_vent_stays 0
total_vent_duration 0
avg_WBC_count 37
avg_body_temp 37
avg_bmi 160
gender 0
age 2
death_flag 0
bronchoscopy 0
charlson_comorbidity_index 0
Heart Rate 66
Respiratory Rate (Total) 343
Temperature Celsius 283
Urine output_ApacheIV 2289
Alanine Aminotransferase (ALT) 897
Albumin 1191
Alkaline Phosphatase 904
Anion Gap 329
Basophils 1079
Bicarbonate 327
Bilirubin, Total 907
Chloride 324
Creatinine 325
Eosinophils 1079
Free Calcium 655
Glucose 2126
Hematocrit 249
Hemoglobin 256
Lactate 337
Lactate Dehydrogenase (LD) 1213
Lymphocytes 1079
MCH 257
MCHC 256
MCV 257
Monocytes 1079
Neutrophils 1079
PT 352
PTT 362
Potassium 327
RDW 258
Red Blood Cells 257
SCT - Normalized Ratio 2289
Sodium 328
Urea Nitrogen 320
White Blood Cells 257
pCO2 218
pH 210
pO2 217
dtype: int64
preprocessed_cohort.drop(columns=columns_to_drop, inplace=True)
Check the total number of lab tests and vital information considered
In [ ]: preprocessed_cohort = preprocessed_cohort.rename(columns=lambda x: x.lower().replace(' ', '_'))
In [153… preprocessed_cohort.head()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 27/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Out[153]:
total_icu_stays total_icu_duration total_vent_stays total_vent_duration avg_WBC_count avg_body_temp avg_bmi gender age death
0 7 53.950 8 43.558 9.288 38.222 31.149509 M 82
1 1 6.595 1 3.778 16.186 38.250 26.543366 M 38
2 2 18.332 2 13.226 13.257 38.685 29.330163 M 66
3 1 20.529 1 20.174 18.686 38.056 25.186267 M 69
4 1 8.703 2 7.390 12.750 38.683 18.783723 F 73
5 rows × 40 columns
In [ ]: csv_filename = 'preprocessed_cohort_2290_mean_null.csv'
machine_learning_data.to_csv(csv_filename, index=True)
# Create a DataFrame for the top 10 correlated pairs with rounded correlations
corr_pairs_df = pd.DataFrame(corr_pairs, columns=["Correlation"])
corr_pairs_df.index.names = ["Feature 1", "Feature 2"] # Adjust index names
tab.auto_set_font_size(False)
tab.set_fontsize(10)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 28/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Removing Outliers
get_whisker_range: The get_whisker_range function calculates and prints the lower and upper whisker values. It is used to eliminate
the outliers.
box_plot: This function creates a box plot for selected columns in a dataset. It also calculates the whisker range for the columns by
calling the get_whisker_range function for the outlier removal process.
remove_outliers: This function filters out outliers from the dataset based on the specified bounds
comparision_plot: This function plots the comparison between different statistics for the original and outlier-removed dataset
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 29/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
import pandas as pd
import numpy as np
#Data Visualization
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns
#Clustering
# Function to calculate and print the lower and upper whisker values for the columns
def get_whisker_range(dataset, columns_to_exclude):
# Select columns to analyze, excluding those in 'columns_to_exclude'
selected_columns = [col for col in dataset.columns if col not in columns_to_exclude]
# Filter the dataset to include only data within the whisker range
dataset = dataset[(dataset[column] >= lower_bound) & (dataset[column] <= upper_bound)]
return whisker_dict
# function to visualize the box plots and return the values of the IQR range of the box plot
def box_plot(dataset, columns_to_exclude, disp):
if (disp == True):
# Create a box plot for the remaining columns
boxplot = dataset.boxplot(figsize=(10, 7))
return whisker_dict
# Function to remove outliers from a dataset based on lower and upper bounds for each column
def remove_outliers(bounds, dataset, columns_to_exclude):
# plot the comparison graph between the summary statistics of the original and outlier-removed dataset
def comparision_plot(statistics_to_compare, dataset,summary_stats,cleaned_data_summary_stats, columns_to_exclude ):
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 30/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Create an empty DataFrame to store the statistics for comparison
comparison_df = pd.DataFrame({'Column': selected_columns})
# Creating a dataframe that contains the statistics value for both the dataset (with and without outliers)
for stat in statistics_to_compare:
mean_original = summary_stats.loc[stat, selected_columns].values
mean_no_outliers = cleaned_data_summary_stats.loc[stat, selected_columns].values
comparison_df[f'{stat}_Original'] = mean_original
comparison_df[f'{stat}_No_Outliers'] = mean_no_outliers
print (f"Difference in {stat} statistic between original and cleaned dataset is\n")
diff = abs(cleaned_data_summary_stats.loc[stat] - summary_stats.loc[stat])
print(diff)
print("\n")
print (f"Percentage change in {stat} statistic between original and cleaned dataset is\n")
print((diff / summary_stats.loc[stat]) * 100)
print("\n")
#Standardize selected numeric columns in the dataset to have mean=0 and standard deviation=1
def feature_scaling(dataset, columns_to_exclude):
# initialize StandardScaler
scaler = StandardScaler()
In [ ]: standardized_data_red_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629 entries, 1 to 2288
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_vent_duration 629 non-null float64
1 avg_WBC_count 629 non-null float64
2 avg_body_temp 629 non-null float64
3 avg_bmi 629 non-null float64
4 age 629 non-null float64
5 Heart Rate 629 non-null float64
6 Respiratory Rate (Total) 629 non-null float64
7 Temperature Celsius 629 non-null float64
8 Chloride 629 non-null float64
9 Creatinine 629 non-null float64
10 Lactate 629 non-null float64
11 MCH 629 non-null float64
12 MCHC 629 non-null float64
13 PT 629 non-null float64
14 PTT 629 non-null float64
15 RDW 629 non-null float64
16 Red Blood Cells 629 non-null float64
17 Urea Nitrogen 629 non-null float64
18 White Blood Cells 629 non-null float64
19 pCO2 629 non-null float64
20 pO2 629 non-null float64
21 bronchoscopy_0 629 non-null uint8
22 bronchoscopy_1 629 non-null uint8
23 death_flag 629 non-null int64
dtypes: float64(21), int64(1), uint8(2)
memory usage: 114.3 KB
Feature Selection
Using RFE (Recursive Feature Elimination)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 31/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Combine the rankings based on the frequency of features being in the top 15
from collections import Counter
combined_rankings = Counter()
for estimator_name, ranking in feature_rankings.items():
for feature_index, rank in enumerate(ranking):
if rank <= 15:
combined_rankings[feature_index] += 1
In [ ]: len(selected_features)
15
Out[ ]:
# Combine the rankings based on the frequency of features being in the top 15
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 32/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
from collections import Counter
combined_rankings = Counter()
for estimator_name, ranking in feature_rankings.items():
for feature_index, rank in enumerate(ranking):
if rank <= 15:
combined_rankings[feature_index] += 1
# Now, 'selected_features' contains the names of the top 15 features selected by the majority of the estimators
print("Selected Features:", selected_features)
# Invert the y-axis to show the most important feature at the top
plt.gca().invert_yaxis()
plt.show()
<ipython-input-147-490735cf65f8>:33: MatplotlibDeprecationWarning:
Unable to determine Axes to steal space for Colorbar. Using gca(), but will raise in the future. Either provide the *
cax* argument to use as the Axes for the Colorbar, provide the *ax* argument to steal space from it, or add *mappable
* to an Axes.
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 33/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Logistic Regression
In [ ]: # Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
precision_score, recall_score, f1_score, accuracy_score, confusion_matrix,
classification_report, roc_auc_score, roc_curve
)
import matplotlib.pyplot as plt
from sklearn.metrics import matthews_corrcoef
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 34/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Output the evaluation metrics
print("Evaluation Metrics:")
print(f'Accuracy: {accuracy:.2f}')
print(f'Sensitivity: {sensitivity:.2f}')
print(f'Specificity: {specificity:.2f}')
print(f'F1 Score: {f1_score:.2f}')
print(f'AUC-ROC: {roc_auc:.2f}')
print(f'Balanced Accuracy: {balanced_acc:.2f}')
Evaluation Metrics:
Accuracy: 0.63
Sensitivity: 0.68
Specificity: 0.61
F1 Score: 0.53
AUC-ROC: 0.67
Balanced Accuracy: 0.65
Confusion Matrix:
True Negative : 292 False Positive : 183
False Negative : 68 True Positive : 144
Classification Report:
precision recall f1-score support
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 35/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
response_variable = 'death_flag'
predictor_variables = [col for col in standardized_data_reduced_svm.columns if col != response_variable]
class_counts = y_train.value_counts()
total_samples = len(y_train)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 36/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Evaluation Metrics:
Accuracy: 0.66
Sensitivity: 0.70
Specificity: 0.64
Precision: 0.47
Recall: 0.70
F1 Score: 0.56
AUC-ROC: 0.67
Balanced Accuracy: 0.67
Confusion Matrix:
[[304 171]
[ 63 149]]
True Negative : 304 False Positive : 171
False Negative : 63 True Positive : 149
Classification Report:
precision recall f1-score support
XGBoost
In [ ]: # Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, classification_
import matplotlib.pyplot as plt
response_variable = 'death_flag'
predictor_variables = [col for col in standardized_data_reduced.columns if col != response_variable]
class_counts = y_train.value_counts()
total_samples = len(y_train)
# Create and train the XGBoost model with class imbalance handling
xgb_model = XGBClassifier(scale_pos_weight=scale_pos_weight)
xgb_model.fit(X_train, y_train)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 37/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
y_pred = xgb_model.predict(X_test)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 38/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Evaluation Metrics:
Accuracy: 0.68
Sensitivity: 0.48
Specificity: 0.77
Precision: 0.48
Recall: 0.48
F1 Score: 0.48
AUC-ROC: 0.67
Balanced Accuracy: 0.63
Confusion Matrix:
[[366 109]
[110 102]]
True Negative : 366 False Positive : 109
False Negative : 110 True Positive : 102
Classification Report:
precision recall f1-score support
Random Forest
In [ ]: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix, precision_score, r
import matplotlib.pyplot as plt
response_variable = 'death_flag'
predictor_variables = [col for col in standardized_data_reduced.columns if col != response_variable]
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 39/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
print(f'F1 Score: {f1_score:.2f}')
print(f'AUC-ROC: {roc_auc:.2f}')
print(f'Balanced Accuracy: {balanced_acc:.2f}')
Evaluation Metrics:
Accuracy: 0.68
Sensitivity: 0.15
Specificity: 0.91
Precision: 0.43
Recall: 0.15
F1 Score: 0.22
AUC-ROC: 0.67
Balanced Accuracy: 0.53
Confusion Matrix:
[[432 43]
[180 32]]
True Negative : 432 False Positive : 43
False Negative : 180 True Positive : 32
Classification Report:
precision recall f1-score support
AUC Comparision
In [ ]: import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, auc
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 40/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Set Seaborn style with a bright color palette
sns.set(style="whitegrid")
sns.set_palette("husl")
Clustering Task
Selected Features
In [ ]: selected_features
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 41/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
['total_vent_stays',
Out[ ]:
'avg_WBC_count',
'avg_bmi',
'age',
'charlson_comorbidity_index',
'Heart Rate',
'Temperature Celsius',
'Alanine Aminotransferase (ALT)',
'Alkaline Phosphatase',
'Anion Gap',
'Basophils',
'Chloride',
'Creatinine',
'Eosinophils',
'Free Calcium']
Data Prepration
In [ ]: cluster_dataframe = preprocessed_cohort.copy()
In [ ]: cluster_dataframe['death_flag'].value_counts()
0 1510
Out[ ]:
1 780
Name: death_flag, dtype: int64
Gender Mapping
In [ ]: gender_mapping = {'M': 0, 'F': 1}
cluster_dataframe['gender'] = cluster_dataframe['gender'].replace(gender_mapping)
<ipython-input-176-874157035c01>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Outlier Removal
In [ ]: from sklearn.ensemble import IsolationForest
# Perform outlier detection and add the 'outlier' column to the DataFrame
standardized_df['outlier'] = iso_forest.predict(standardized_df)
K-Means Clustering
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 42/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Function definition
In [ ]: from sklearn.cluster import KMeans # For K-means clustering
from sklearn.decomposition import PCA # For Principal Component Analysis
from sklearn.metrics import silhouette_score # For Silhouette score
#Defining a function that will perform k_means clustering for different k values
def k_means_clustering(tot_clusters, data, clustering_columns ):
data = data.copy()
return data
#Display the cumulative variance plot that is used to decide the number of PCA components
def cummulative_variance_plot(dataset, columns, title):
# Perform PCA analysis and visualizes first 2 Principal components (2-D visualization)
def plot_pc_analysis(data, num_clusters):
plt.grid(True)
# Display the scatter plot with centroids
plt.show()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 43/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
In [ ]: columns_to_exclude = ['death_flag']
clustering_columns = [col for col in standardized_df.columns if col not in columns_to_exclude]
# Fit the KMeans model to the scaled dataset using selected clustering columns
kmeans.fit(standardized_df[clustering_columns])
# Fit the KMeans model to the scaled dataset using selected clustering columns
kmeans.fit(standardized_df[clustering_columns])
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 44/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
plt.grid(True)
plt.show()
Number of Clusters = 3
Clustering Plot
In [ ]: standardized_data_reduced_km_2 = k_means_clustering(3, standardized_data_reduced, clustering_columns)
<ipython-input-49-b54e9936686e>:116: UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, whic
h should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the
*color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value
for all points.
plt.scatter(cluster_data['PC1'], cluster_data['PC2'], label=f'Cluster {cluster_id}', c=palette[cluster_id], alpha=
0.7)
<ipython-input-49-b54e9936686e>:116: UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, whic
h should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the
*color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value
for all points.
plt.scatter(cluster_data['PC1'], cluster_data['PC2'], label=f'Cluster {cluster_id}', c=palette[cluster_id], alpha=
0.7)
<ipython-input-49-b54e9936686e>:116: UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, whic
h should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the
*color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value
for all points.
plt.scatter(cluster_data['PC1'], cluster_data['PC2'], label=f'Cluster {cluster_id}', c=palette[cluster_id], alpha=
0.7)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 45/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Clustering Analysis
In [ ]: cluster_groups = standardized_data_reduced_km_2.groupby('cluster')
print(mortality_percentage)
print(gender_percentage)
print(bronchoscopy_percentage)
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 46/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
Mortality Percentage by Cluster:
cluster
0 27.054610
1 53.979239
2 34.255913
Name: death_flag, dtype: float64
cluster Mortality Percentage Survived Percentage
0 0 27.054610 72.945390
1 1 53.979239 46.020761
2 2 34.255913 65.744087
cluster Female Percentage Male Percentage
0 0 60.308057 39.691943
1 1 69.550173 30.449827
2 2 59.204840 40.795160
cluster With bronchoscopy Without bronchoscopy
0 0 46.682464 53.317536
1 1 39.792388 60.207612
2 2 49.092481 50.907519
cluster = mortality_percentage['cluster']
mortality_percentage_values = mortality_percentage['Mortality Percentage']
survived_percentage_values = mortality_percentage['Survived Percentage']
bar_width = 0.35
bar_positions = range(len(cluster))
add_percentage_labels(bar1)
add_percentage_labels(bar2)
plt.tight_layout()
plt.savefig('mortality.png', dpi=300, bbox_inches='tight')
# Show the graph
plt.show()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 47/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
add_percentage_labels(axes[0])
add_percentage_labels(axes[1])
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 48/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Linewidth
line_width = 1
# Marker size
marker_size = 6
# Legend
axes.legend(loc='best')
# Grid
axes.grid(True)
# Tight layout
plt.tight_layout()
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 49/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
total_icu_stays total_vent_stays avg_bmi age \
count 1157.000000 1157.000000 1157.000000 1157.000000
mean 2.420916 2.854797 28.275788 69.442304
std 2.379631 2.231344 7.103919 10.329259
min 1.000000 1.000000 0.308642 29.000000
25% 1.000000 1.000000 23.689398 62.000000
50% 2.000000 2.000000 28.061224 70.000000
75% 3.000000 3.000000 31.409788 77.000000
max 25.000000 25.000000 96.604388 91.000000
# Set the style for better readability (you can adjust the style to your preference)
sns.set(style="whitegrid")
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 50/51
28/10/2023, 20:56 COMP90089_JupyterNotebook
# Check if the average values are within the reference range for each cluster
for cluster in average_values_by_cluster['cluster']:
cluster_data = average_values_by_cluster[average_values_by_cluster['cluster'] == cluster]
within_range = {}
for col in columns_to_average:
if col in ref_ranges:
col_lower, col_upper = ref_ranges[col]
#print (ref_ranges[col])
within_range[f'{col}_within_range'] = (cluster_data[col].values[0] > col_lower) and (cluster_data[col].val
else:
within_range[f'{col}_within_range'] = 'null'
result_df.loc[len(result_df)] = [cluster, *list(within_range.values())]
In [ ]: result_df
merged_df
Alanine
cluster Heart Rate Temperature Alkaline Anion Basophils Chloride Free Am
Out[ ]:
Celsius Aminotransferase
(ALT) Phosphatase Gap Creatinine Eosinophils Calcium
(ALT
0 0 105.173990 37.163008 179.142400 120.606253 14.706229 0.195095 104.533380 1.193389 0.919454 1.108535
1 1 94.074211 37.098196 381.340534 140.365630 21.479079 0.188954 98.293658 4.116356 0.929007 1.088096
2 2 84.808210 37.096441 111.705621 106.118779 13.996107 0.230852 104.895903 1.264473 1.124339 1.156892
file:///Users/ritwikgiri/Downloads/COMP90089_JupyterNotebook.html 51/51