Objective: To test the hypothesis that accuracy, discrimination, and precision in predicting post... more Objective: To test the hypothesis that accuracy, discrimination, and precision in predicting postoperative complications improve when using both preoperative and intraoperative data input features versus preoperative data alone. Background: Models that predict postoperative complications often ignore important intraoperative events and physiological changes. Incorporation of intraoperative physiological data may improve model performance. Methods: This retrospective cohort analysis included 43,943 adults undergoing 52,529 inpatient surgeries at a single institution during a five-year period. Random forest machine learning models in the validated MySurgeryRisk platform made patient-level predictions for three postoperative complications and mortality during hospital admission using electronic health record data and patient neighborhood characteristics. For each outcome, one model trained with preoperative data alone and one model trained with both preoperative and intraoperative data. Models were compared by accuracy, discrimination (expressed as AUROC: area under the receiver operating characteristic curve), precision (expressed as AUPRC: area under the precision-recall curve), and reclassification indices (NRI). Results: Machine learning models incorporating both preoperative and intraoperative data had greater accuracy, discrimination, and precision than models using preoperative data alone for predicting all three postoperative complications (intensive care unit length of stay >48 hours, mechanical ventilation >48 hours, and neurological complications including delirium) and in-hospital mortality (accuracy: 88% vs. 77%,
With advancements in fields such as computational chemistry, computer-aided molecular design and ... more With advancements in fields such as computational chemistry, computer-aided molecular design and chemoinformatics, the scientific community has now become inundated with a very large set of molecular descriptors. The advantage of availability of large set of descriptors is that computational modelers can now capture different characteristics of molecules of varying sizes in different solvent/reaction mediums. However, the drawback is that during model development, the number of descriptors can exceed the number of instances in a dataset. Such datasets are known as high-dimensional data matrix. This is especially the case when the process of data generation is complex, time-consuming and/or resource intensive. Apart from these reasons, this can also happen when a specific product needs to be developed for a very specific use (e.g. drugs for a specific physical condition, polymers of a specific property, reaction in a specific environment). These cases tend to be very condition-specific, e.g. type of chemical species, activities or responses in specific environment, temperature, pressure, etc. The challenges of modeling such cases include but are not limited to; difficulty of generating a generalizable model, large model uncertainty and overfitting of model(s) generated. To address the aforementioned drawbacks and ensuing challenges, in this work, we have developed hybrid algorithms which are efficient and can generate generalizable models. These algorithms overcome the disadvantage of traditional modeling techniques that break down when the number of descriptors exceed the sample size. The developed algorithms, in our work, can be incorporated in software platforms, useful for automated design of product-centric industrial processes. Such software should be capable of analyzing experimental data and generating the best possible molecular structure for the specific constraints and objectives. It is also required to iii be fast and accurate at the same time. In the past, such situations were tackled with ab initio calculations, later replaced by DFT (Density Function Theory) based calculations. Apart from being computationally expensive, such methods include problems of manual handling of data for molecular design operations. To address such limitations, molecular descriptors (0D-7D) became attractive alternatives. However, the complexity of the calculation of descriptors increases with the complexity of the molecular structure. 2D (2 dimensional) descriptors, such as connectivity index descriptors, have been proven to be efficient in model generation with significant accuracy. Also, the design calculation steps are not computationally expensive. For these reasons, in this work, the generated models are based on 2D molecular descriptors. In this work, two unique condition-specific situations have been discussed. Case 1 encompasses relating reactant and solvent structures to the reaction rate constants for Diels Alder reactions. As reaction rates are more prone to depend of inter-atom connectivity, connectivity index descriptors were used to develop this model. A hybrid GA-DT (Genetic Algorithm-Decision Tree) algorithm was developed to select features and for model development. This case is unique as it involves the study of three different chemical species while generating the predictive model, and hence a challenge for both traditional and newly developed hybrid algorithms. Further improvements for the model were proposed using Multi-Gene Genetic Programming (MGGP) algorithm to derive non-linear models. Case 2 is based on developing a model to relate structures of 9-Anilinoacridine derivatives with respective DNA-drug binding affinity values. Although this case has only one group of chemical species under consideration, challenges emerge when two or more models with similar metrics are generated. Although the genetic algorithm was used for feature selection, initially, a novel adaptive version of LASSO (Least Absolute Shrinkage and Selection Operator) algorithm was developed. This adaptive correlation-based LASSO This astonishing journey of last five years, in many ways, has occurred in debt of many amazing persons, both in Auburn and back home. I was fortunate to have acquaintance with people who admired a poetic soul in STEM; people who saw a philosopher in me and supported that ideology and philosophy while blossoming through scientific research. Above all, I was fortunate to have acquaintance with people who cared to know, understand, and be part of this journey, making these five years to be the best part of my life so far. First and foremost, my heartfelt gratitude to Dr. Mario R. Eden, my thesis supervisor. None can ask for a better supervisor. This part of my life has been very enjoyable due to his guidance, friendship, and endless support. He has provided me opportunities not only to pursue my passion for research, but also to present my work and connect with my peers both in US and abroad. Nothing can be truer than the fact that the advisor, particularly while pursuing a doctorate, can make all the difference in growing as a researcher, and a wondering soul. Dr. Eden did everything humanly possible to provide me with the environment to grow as a researcher, a professional, a problem solver, and a seeker of knowledge. For that, I am forever in debt of his gracious nature. I would also like to take the opportunity to acknowledge the support provided by my committee members, Dr. Allan David, Dr. Elizabeth Lipke, and Dr. Thomas Gallagher. Their feedback, support, and constant interest in my work can be considered of utmost importance for development of my dissertation, as well as the quality of my work. To add, I have been the most vi fortunate to have the following people as my peers and groupmates. The list includes, but
Computer Aided Molecular Design (CAMD) has been used to supplement and guide experimental efforts... more Computer Aided Molecular Design (CAMD) has been used to supplement and guide experimental efforts in various fields. Currently, highly predictive models can be developed as a result of improved computing hardware along with novel methodologies for model development. Decreasing dependency on experimental efforts can also reduce the environmental footprint of companies involved in product design and development. Though being a well-practiced process, crystallization is comparatively a much less studied separation unit operation. Not much is known about the variables that affect quality and usefulness of the end product gained from crystallization. Studies suggest that the interactions between solute and solvent are quite difficult to quantify and large variance can be noticed with every solute-solvent combination. Pharmaceutical companies are widely known for using crystallization operations to develop their end product. Additionally, crystal morphology has a significant impact on how...
Accurate prediction of postoperative complications can inform shared decisions between patients a... more Accurate prediction of postoperative complications can inform shared decisions between patients and surgeons regarding the appropriateness of surgery, preoperative risk-reduction strategies, and postoperative resource use. Traditional predictive analytic tools are hindered by suboptimal performance and usability. We hypothesized that novel deep learning techniques would outperform logistic regression models in predicting postoperative complications. In a single-center longitudinal cohort of 43,943 adult patients undergoing 52,529 major inpatient surgeries, deep learning yielded greater discrimination than logistic regression for all nine complications. Predictive performance was strongest when leveraging the full spectrum of preoperative and intraoperative physiologic time-series electronic health record data. A single multi-task deep learning model yielded greater performance than separate models trained on individual complications. Integrated gradients interpretability mechanisms ...
Abstract The design of molecular solvents has garnered significant interest because solvents have... more Abstract The design of molecular solvents has garnered significant interest because solvents have been shown to influence the rate of chemical product generation in a reaction. In order to quantitatively understand the influence of solvent structure on the rate of the reaction, models are needed that capture this influence, in addition to that of the reactants’ structure, on the rate constant. A quantitative structure-property relationship (QSPR) for the Diels-Alder reaction was recently developed using a hybrid genetic algorithm-decision tree (GA-DT) approach. However, there is still scope for improvement in the performance of the QSPR. In an attempt to further improve upon the performance of the aforementioned QSPR, we have assessed various tree based ensemble machine learning regression methods for prediction of rate constant (modeled using connectivity indices) of Diels-Alder reaction. The assessed methods are random forest regression, gradient boosted regression trees, regularized random forest regression and extremely randomized trees. The evaluation was carried out in terms of the R2 and Q2 values. Extremely randomized trees were found to provide the highest R2 value of 0.91 while random forests provided the highest Q2 value of 0.76.
BACKGROUND Postoperative acute kidney injury is common after major vascular surgery and is associ... more BACKGROUND Postoperative acute kidney injury is common after major vascular surgery and is associated with increased morbidity, mortality, and cost. High-performance risk stratification using a machine learning model can inform strategies that mitigate harm and optimize resource use. It is hypothesized that incorporating intraoperative data would improve machine learning model accuracy, discrimination, and precision in predicting acute kidney injury among patients undergoing major vascular surgery. METHODS A single-center retrospective cohort of 1,531 adult patients who underwent nonemergency major vascular surgery, including open aortic, endovascular aortic, and lower extremity bypass procedures, was evaluated. The validated, automated MySurgeryRisk analytics platform used electronic health record data to forecast patient-level probabilistic risk scores for postoperative acute kidney injury using random forest models with preoperative data alone and perioperative data (preoperative plus intraoperative). The MySurgeryRisk predictions were compared with each other as well as with the American Society of Anesthesiologists physical status classification. RESULTS Machine learning models using perioperative data had greater accuracy, discrimination, and precision than models using either preoperative data alone or the American Society of Anesthesiologists physical status classification (accuracy: 0.70 vs 0.64 vs 0.62, area under the receiver operating characteristics curve: 0.77 vs 0.68 vs 0.61, area under the precision-recall curve: 0.70 vs 0.58 vs 0.48). CONCLUSION In predicting acute kidney injury after major vascular surgery, machine learning approaches that incorporate dynamic intraoperative data had greater accuracy, discrimination, and precision than models using either preoperative data alone or the American Society of Anesthesiologists physical status classification. Machine learning methods have the potential for real-time identification of high-risk patients who may benefit from personalized risk-reduction strategies.
Abstract Major advancements in the field of machine learning and readily available inexpensive co... more Abstract Major advancements in the field of machine learning and readily available inexpensive computational power, QSPRs (quantitative structure property relationships) are increasingly being viewed, by the scientific community, as reliable tools that can provide accurate property prediction. Additionally, QSPRs offer advantages of experimental cost reduction and reduction in chemical footprint associated with experiments. Treatment of cancerous tumors has become a global focus due to the heightened prevalence of such tumors in humans, both young and old. Apart from surgery, the most commonly used treatment is chemotherapy. As there are many long-term side effects of chemotherapy such as organ damage, fatigue, hair loss and tooth loss, researchers are devoting much attention to the search of treatments with fewer side effects. So far, no effective solution has emerged which can be reported as an alternative to chemotherapy. In a recent study, thirty one (31) 9-anilinoacridines were synthesized and evaluated for their antitumor activity. The association constant, K , was utilized as a key determining factor to evaluate the DNA drug binding affinity. 9-anilinoacridines show great promise as antitumor agents. In order to help reduce the experimental effort of K value determination and to assist in the design of 9-anilinoacridines, in this work, we developed a QSPR to predict K . In order to develop the QSPR, all the structures were drawn and optimized using the Avogadro software and converted to mol files. The Dragon 6 software was then used to calculate the values of descriptors using the generated mol files. The descriptors were then used to develop the model using GA (genetic algorithm) and CorrLASSO (correlation-based adaptive least absolute shrinkage and selection operator). The CorrLASSO in combination with GA helped generate a model with superior prediction as compared with the combination of GA and LASSO (least absolute shrinkage and selection operator) and GA-MLR (genetic algorithm-multiple linear regression). In our work, R 2 , Q 2 and MSE (mean squared error) calculations have been performed to assess model performance and data fitness.
Patients and physicians make essential decisions regarding diagnostic and therapeutic interventio... more Patients and physicians make essential decisions regarding diagnostic and therapeutic interventions. These actions should be performed or deferred under time constraints and uncertainty regarding patients' diagnoses and predicted response to treatment. This may lead to cognitive and judgment errors. Reinforcement learning is a subfield of machine learning that identifies a sequence of actions to increase the probability of achieving a predetermined goal. Reinforcement learning has the potential to assist in surgical decision making by recommending actions at predefined intervals and its ability to utilize complex input data, including text, image, and temporal data, in the decision-making process. The algorithm mimics a human trial-and-error learning process to calculate optimum recommendation policies. The article provides insight regarding challenges in the development and application of reinforcement learning in the medical field, with an emphasis on surgical decision making. The review focuses on challenges in formulating reward function describing the ultimate goal and determination of patient states derived from electronic health records, along with the lack of resources to simulate the potential benefits of suggested actions in response to changing physiological states during and after surgery. Although clinical implementation would require secure, interoperable, livestreaming electronic health record data for use by virtual model, development and validation of personalized reinforcement learning models in surgery can contribute to improving care by helping patients and clinicians make better decisions.
Abstract Developing a QSPR model, which not only captures the influence of reactant structures bu... more Abstract Developing a QSPR model, which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is of significance. Such QSPR models will serve as a prerequisite for the simultaneous computer-aided molecular design (CAMD) of reactants, products and solvents. They will also be useful in predicting the rate constant without entirely relying on experiments. To develop such a QSPR, recently, Datta et al. (2017) used the Diels-Alder reaction as a case study. Their model displayed great promise, but there is scope for improvement in the model's prediction metrics. In our work, we improve upon their model by introducing non-linearity. This is achieved using multi-gene genetic programming (MGGP). In our methodology, a combination of genetic algorithm (GA) and directed trees was used to develop a branched version of chromosomes, allowing increased possibility of generation of models with high prediction metrics. In our work, prior to model development through MGGP, principal component analysis (PCA) was conducted. Lastly, models were evaluated based on metrics such as R2, Q2, and RMSE.
Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for de... more Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for defining and designing reactions at their maximal potential. In all of these contributions, either the structures of reactants/products have been considered to be unchanging or the solvent structure. Developing a QSPR model which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is essential. Since the structures of reactants and products are related, such QSPR models will serve as a prerequisite for the simultaneous CAMD of reactants, products and solvents. They will also provide a useful tool for predicting the rate constant without relying on experiments. To develop such a QSPR, in our work, the Diels-Alder reaction with different sets of reactants and solvents was investigated. Connectivity indices were used to represent the structures of the members of each set. Principal Component Analysis (PCA) was applied to identify principal components (PCs) corresponding to the structures of reactants and solvent of each set. Linear models expressed in terms of PCs were then generated using a Decision Tree (DT) algorithm such that the R 2 value was maximized. These models formed the initial population on which the GA performed operations such as crossover and mutation to obtain model(s) with best rate constant prediction. Thus, the novelty of our approach is that after feature extraction using PCA, a DT algorithm generates an ensemble of linear models, which through the GA is transformed into a model with best fit. Our approach required much lesser generations to provide a model with highest R 2 ext value as compared to the case where the DT did not initialize the population of models.
Abstract In the United States, cancer is the second leading cause of death. Worldwide too, cancer... more Abstract In the United States, cancer is the second leading cause of death. Worldwide too, cancer is a major health problem. Hence, treatment of cancerous tumors remains a matter of very high concern. Apart from surgical treatment, the most commonly employed treatment is chemotherapy. But, due to long-term side effects such as organ damage and loss of teeth, doctors and patients are interested in treatments with reduced side effects. So far, a reasonably acceptable alternative to chemotherapy has not emerged. Recently, 9-anilinoacridines were evaluated as potential antitumor agents due to their enhanced tendency of DNA binding. For an initial evaluation of the drug performance, the association constant, K, is considered to be the key DNA drug binding property. In our work, to reduce experimental efforts and the associated chemical footprint, we develop a QSPR to model K. In our work, to model K, we utilized descriptors requiring representation of molecular structures in two dimensions or less. To establish a relationship between the descriptors and K, we have developed a correlation based adaptive LASSO algorithm (CorrLASSO). CorrLASSO, like LASSO (least absolute shrinkage and selection operator), incorporates feature selection as part of the learning procedure. Also, it is useful for dealing with high-dimensional data. As an improvement, CorrLASSO evaluates correlation between descriptors/features and the dependent property to generate a model with high performance metrics. In our work, R2, Q2 and MSE (mean square error) were utilized as performance metrics.
Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for de... more Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for defining and designing reactions at their maximal potential. In all these contributions, either the reactants/products were considered constant or the solvents. Developing a QSPR model which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is essential. Since the structures of reactants and products are related, such QSPR models will serve as a prerequisite for the simultaneous CAMD of reactants, products and solvents. They will also provide a useful tool for predicting the rate constant without relying on experiments. To develop such a QSPR, in our work, the Diels-Alder reaction with different sets of reactants and solvents was investigated. Connectivity indices were used to represent the structures of the members of each set. Principal Component Analysis (PCA) was applied to identify the principal components (PCs) of each set for further use in model development. These PCs were then used to develop a linear model that best predicts the reaction rates in our study. In this paper, Genetic Algorithm (GA) has been modified using the Decision Tree (DT) algorithm for increased efficiency. Inclusion of DT in GA ensures an initial generation of meaningful combination of descriptors. This set gets further improved in every step of crossover and mutation with applied constraints. Only improvement of generations is accepted due to these constraints. Finally, Multiple Linear Regression (MLR) relates the chosen descriptors with the property under study. The model undergoes thorough internal and external validation to ensure that a best fit model can be found in minimum steps possible.
Abstract This contribution outlines the development of a quantitative structure-property relation... more Abstract This contribution outlines the development of a quantitative structure-property relationship (QSPR) that relates solvent structure to the morphology of ibuprofen crystals grown within that solvent. Morphology can be quantified by aspect ratio, and ibuprofen aspect ratio data was obtained for crystals grown in 16 different organic solvents. A combination of 2D and 3D molecular descriptors are calculated to provide a quantitative representation of the geometry optimized solvent molecules. Empirical force fields were used to estimate the three-dimensional structure of the solvent molecules and three different force fields are implemented while their effect on the developed models is analyzed. The descriptor data matrix, containing a multitude of descriptor types, is reduced in size, using Bayesian Information Criterion (BIC) methods and also Principal Component Analysis (PCA), for regression into linear models through Principal Component Regression. The predictive capabilities of these models were also analyzed through means of internal and external validation methods.
Background: Models that predict postoperative complications often ignore important intraoperative... more Background: Models that predict postoperative complications often ignore important intraoperative events and physiological changes. This study tested the hypothesis that accuracy, discrimination, and precision in predicting postoperative complications would improve when using both preoperative and intraoperative data input data compared with preoperative data alone. Methods: This retrospective cohort analysis included 43,943 adults undergoing 52,529 inpatient surgeries at a single institution during a five-year period. Random forest machine learning models in the validated MySurgeryRisk platform made patient-level predictions for seven postoperative complications and mortality occurring during hospital admission using electronic health record data and patient neighborhood characteristics. For each outcome, one model trained with preoperative data alone; one model trained with both preoperative and intraoperative data. Models were compared by accuracy, discrimination (expressed as AUROC: area under the receiver operating characteristic curve), precision (expressed as AUPRC: area under the precision-recall curve), and reclassification indices. Results: Machine learning models incorporating both preoperative and intraoperative data had greater accuracy, discrimination, and precision than models using preoperative data alone for
Objective: To test the hypothesis that accuracy, discrimination, and precision in predicting post... more Objective: To test the hypothesis that accuracy, discrimination, and precision in predicting postoperative complications improve when using both preoperative and intraoperative data input features versus preoperative data alone. Background: Models that predict postoperative complications often ignore important intraoperative events and physiological changes. Incorporation of intraoperative physiological data may improve model performance. Methods: This retrospective cohort analysis included 43,943 adults undergoing 52,529 inpatient surgeries at a single institution during a five-year period. Random forest machine learning models in the validated MySurgeryRisk platform made patient-level predictions for three postoperative complications and mortality during hospital admission using electronic health record data and patient neighborhood characteristics. For each outcome, one model trained with preoperative data alone and one model trained with both preoperative and intraoperative data. Models were compared by accuracy, discrimination (expressed as AUROC: area under the receiver operating characteristic curve), precision (expressed as AUPRC: area under the precision-recall curve), and reclassification indices (NRI). Results: Machine learning models incorporating both preoperative and intraoperative data had greater accuracy, discrimination, and precision than models using preoperative data alone for predicting all three postoperative complications (intensive care unit length of stay >48 hours, mechanical ventilation >48 hours, and neurological complications including delirium) and in-hospital mortality (accuracy: 88% vs. 77%,
With advancements in fields such as computational chemistry, computer-aided molecular design and ... more With advancements in fields such as computational chemistry, computer-aided molecular design and chemoinformatics, the scientific community has now become inundated with a very large set of molecular descriptors. The advantage of availability of large set of descriptors is that computational modelers can now capture different characteristics of molecules of varying sizes in different solvent/reaction mediums. However, the drawback is that during model development, the number of descriptors can exceed the number of instances in a dataset. Such datasets are known as high-dimensional data matrix. This is especially the case when the process of data generation is complex, time-consuming and/or resource intensive. Apart from these reasons, this can also happen when a specific product needs to be developed for a very specific use (e.g. drugs for a specific physical condition, polymers of a specific property, reaction in a specific environment). These cases tend to be very condition-specific, e.g. type of chemical species, activities or responses in specific environment, temperature, pressure, etc. The challenges of modeling such cases include but are not limited to; difficulty of generating a generalizable model, large model uncertainty and overfitting of model(s) generated. To address the aforementioned drawbacks and ensuing challenges, in this work, we have developed hybrid algorithms which are efficient and can generate generalizable models. These algorithms overcome the disadvantage of traditional modeling techniques that break down when the number of descriptors exceed the sample size. The developed algorithms, in our work, can be incorporated in software platforms, useful for automated design of product-centric industrial processes. Such software should be capable of analyzing experimental data and generating the best possible molecular structure for the specific constraints and objectives. It is also required to iii be fast and accurate at the same time. In the past, such situations were tackled with ab initio calculations, later replaced by DFT (Density Function Theory) based calculations. Apart from being computationally expensive, such methods include problems of manual handling of data for molecular design operations. To address such limitations, molecular descriptors (0D-7D) became attractive alternatives. However, the complexity of the calculation of descriptors increases with the complexity of the molecular structure. 2D (2 dimensional) descriptors, such as connectivity index descriptors, have been proven to be efficient in model generation with significant accuracy. Also, the design calculation steps are not computationally expensive. For these reasons, in this work, the generated models are based on 2D molecular descriptors. In this work, two unique condition-specific situations have been discussed. Case 1 encompasses relating reactant and solvent structures to the reaction rate constants for Diels Alder reactions. As reaction rates are more prone to depend of inter-atom connectivity, connectivity index descriptors were used to develop this model. A hybrid GA-DT (Genetic Algorithm-Decision Tree) algorithm was developed to select features and for model development. This case is unique as it involves the study of three different chemical species while generating the predictive model, and hence a challenge for both traditional and newly developed hybrid algorithms. Further improvements for the model were proposed using Multi-Gene Genetic Programming (MGGP) algorithm to derive non-linear models. Case 2 is based on developing a model to relate structures of 9-Anilinoacridine derivatives with respective DNA-drug binding affinity values. Although this case has only one group of chemical species under consideration, challenges emerge when two or more models with similar metrics are generated. Although the genetic algorithm was used for feature selection, initially, a novel adaptive version of LASSO (Least Absolute Shrinkage and Selection Operator) algorithm was developed. This adaptive correlation-based LASSO This astonishing journey of last five years, in many ways, has occurred in debt of many amazing persons, both in Auburn and back home. I was fortunate to have acquaintance with people who admired a poetic soul in STEM; people who saw a philosopher in me and supported that ideology and philosophy while blossoming through scientific research. Above all, I was fortunate to have acquaintance with people who cared to know, understand, and be part of this journey, making these five years to be the best part of my life so far. First and foremost, my heartfelt gratitude to Dr. Mario R. Eden, my thesis supervisor. None can ask for a better supervisor. This part of my life has been very enjoyable due to his guidance, friendship, and endless support. He has provided me opportunities not only to pursue my passion for research, but also to present my work and connect with my peers both in US and abroad. Nothing can be truer than the fact that the advisor, particularly while pursuing a doctorate, can make all the difference in growing as a researcher, and a wondering soul. Dr. Eden did everything humanly possible to provide me with the environment to grow as a researcher, a professional, a problem solver, and a seeker of knowledge. For that, I am forever in debt of his gracious nature. I would also like to take the opportunity to acknowledge the support provided by my committee members, Dr. Allan David, Dr. Elizabeth Lipke, and Dr. Thomas Gallagher. Their feedback, support, and constant interest in my work can be considered of utmost importance for development of my dissertation, as well as the quality of my work. To add, I have been the most vi fortunate to have the following people as my peers and groupmates. The list includes, but
Computer Aided Molecular Design (CAMD) has been used to supplement and guide experimental efforts... more Computer Aided Molecular Design (CAMD) has been used to supplement and guide experimental efforts in various fields. Currently, highly predictive models can be developed as a result of improved computing hardware along with novel methodologies for model development. Decreasing dependency on experimental efforts can also reduce the environmental footprint of companies involved in product design and development. Though being a well-practiced process, crystallization is comparatively a much less studied separation unit operation. Not much is known about the variables that affect quality and usefulness of the end product gained from crystallization. Studies suggest that the interactions between solute and solvent are quite difficult to quantify and large variance can be noticed with every solute-solvent combination. Pharmaceutical companies are widely known for using crystallization operations to develop their end product. Additionally, crystal morphology has a significant impact on how...
Accurate prediction of postoperative complications can inform shared decisions between patients a... more Accurate prediction of postoperative complications can inform shared decisions between patients and surgeons regarding the appropriateness of surgery, preoperative risk-reduction strategies, and postoperative resource use. Traditional predictive analytic tools are hindered by suboptimal performance and usability. We hypothesized that novel deep learning techniques would outperform logistic regression models in predicting postoperative complications. In a single-center longitudinal cohort of 43,943 adult patients undergoing 52,529 major inpatient surgeries, deep learning yielded greater discrimination than logistic regression for all nine complications. Predictive performance was strongest when leveraging the full spectrum of preoperative and intraoperative physiologic time-series electronic health record data. A single multi-task deep learning model yielded greater performance than separate models trained on individual complications. Integrated gradients interpretability mechanisms ...
Abstract The design of molecular solvents has garnered significant interest because solvents have... more Abstract The design of molecular solvents has garnered significant interest because solvents have been shown to influence the rate of chemical product generation in a reaction. In order to quantitatively understand the influence of solvent structure on the rate of the reaction, models are needed that capture this influence, in addition to that of the reactants’ structure, on the rate constant. A quantitative structure-property relationship (QSPR) for the Diels-Alder reaction was recently developed using a hybrid genetic algorithm-decision tree (GA-DT) approach. However, there is still scope for improvement in the performance of the QSPR. In an attempt to further improve upon the performance of the aforementioned QSPR, we have assessed various tree based ensemble machine learning regression methods for prediction of rate constant (modeled using connectivity indices) of Diels-Alder reaction. The assessed methods are random forest regression, gradient boosted regression trees, regularized random forest regression and extremely randomized trees. The evaluation was carried out in terms of the R2 and Q2 values. Extremely randomized trees were found to provide the highest R2 value of 0.91 while random forests provided the highest Q2 value of 0.76.
BACKGROUND Postoperative acute kidney injury is common after major vascular surgery and is associ... more BACKGROUND Postoperative acute kidney injury is common after major vascular surgery and is associated with increased morbidity, mortality, and cost. High-performance risk stratification using a machine learning model can inform strategies that mitigate harm and optimize resource use. It is hypothesized that incorporating intraoperative data would improve machine learning model accuracy, discrimination, and precision in predicting acute kidney injury among patients undergoing major vascular surgery. METHODS A single-center retrospective cohort of 1,531 adult patients who underwent nonemergency major vascular surgery, including open aortic, endovascular aortic, and lower extremity bypass procedures, was evaluated. The validated, automated MySurgeryRisk analytics platform used electronic health record data to forecast patient-level probabilistic risk scores for postoperative acute kidney injury using random forest models with preoperative data alone and perioperative data (preoperative plus intraoperative). The MySurgeryRisk predictions were compared with each other as well as with the American Society of Anesthesiologists physical status classification. RESULTS Machine learning models using perioperative data had greater accuracy, discrimination, and precision than models using either preoperative data alone or the American Society of Anesthesiologists physical status classification (accuracy: 0.70 vs 0.64 vs 0.62, area under the receiver operating characteristics curve: 0.77 vs 0.68 vs 0.61, area under the precision-recall curve: 0.70 vs 0.58 vs 0.48). CONCLUSION In predicting acute kidney injury after major vascular surgery, machine learning approaches that incorporate dynamic intraoperative data had greater accuracy, discrimination, and precision than models using either preoperative data alone or the American Society of Anesthesiologists physical status classification. Machine learning methods have the potential for real-time identification of high-risk patients who may benefit from personalized risk-reduction strategies.
Abstract Major advancements in the field of machine learning and readily available inexpensive co... more Abstract Major advancements in the field of machine learning and readily available inexpensive computational power, QSPRs (quantitative structure property relationships) are increasingly being viewed, by the scientific community, as reliable tools that can provide accurate property prediction. Additionally, QSPRs offer advantages of experimental cost reduction and reduction in chemical footprint associated with experiments. Treatment of cancerous tumors has become a global focus due to the heightened prevalence of such tumors in humans, both young and old. Apart from surgery, the most commonly used treatment is chemotherapy. As there are many long-term side effects of chemotherapy such as organ damage, fatigue, hair loss and tooth loss, researchers are devoting much attention to the search of treatments with fewer side effects. So far, no effective solution has emerged which can be reported as an alternative to chemotherapy. In a recent study, thirty one (31) 9-anilinoacridines were synthesized and evaluated for their antitumor activity. The association constant, K , was utilized as a key determining factor to evaluate the DNA drug binding affinity. 9-anilinoacridines show great promise as antitumor agents. In order to help reduce the experimental effort of K value determination and to assist in the design of 9-anilinoacridines, in this work, we developed a QSPR to predict K . In order to develop the QSPR, all the structures were drawn and optimized using the Avogadro software and converted to mol files. The Dragon 6 software was then used to calculate the values of descriptors using the generated mol files. The descriptors were then used to develop the model using GA (genetic algorithm) and CorrLASSO (correlation-based adaptive least absolute shrinkage and selection operator). The CorrLASSO in combination with GA helped generate a model with superior prediction as compared with the combination of GA and LASSO (least absolute shrinkage and selection operator) and GA-MLR (genetic algorithm-multiple linear regression). In our work, R 2 , Q 2 and MSE (mean squared error) calculations have been performed to assess model performance and data fitness.
Patients and physicians make essential decisions regarding diagnostic and therapeutic interventio... more Patients and physicians make essential decisions regarding diagnostic and therapeutic interventions. These actions should be performed or deferred under time constraints and uncertainty regarding patients' diagnoses and predicted response to treatment. This may lead to cognitive and judgment errors. Reinforcement learning is a subfield of machine learning that identifies a sequence of actions to increase the probability of achieving a predetermined goal. Reinforcement learning has the potential to assist in surgical decision making by recommending actions at predefined intervals and its ability to utilize complex input data, including text, image, and temporal data, in the decision-making process. The algorithm mimics a human trial-and-error learning process to calculate optimum recommendation policies. The article provides insight regarding challenges in the development and application of reinforcement learning in the medical field, with an emphasis on surgical decision making. The review focuses on challenges in formulating reward function describing the ultimate goal and determination of patient states derived from electronic health records, along with the lack of resources to simulate the potential benefits of suggested actions in response to changing physiological states during and after surgery. Although clinical implementation would require secure, interoperable, livestreaming electronic health record data for use by virtual model, development and validation of personalized reinforcement learning models in surgery can contribute to improving care by helping patients and clinicians make better decisions.
Abstract Developing a QSPR model, which not only captures the influence of reactant structures bu... more Abstract Developing a QSPR model, which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is of significance. Such QSPR models will serve as a prerequisite for the simultaneous computer-aided molecular design (CAMD) of reactants, products and solvents. They will also be useful in predicting the rate constant without entirely relying on experiments. To develop such a QSPR, recently, Datta et al. (2017) used the Diels-Alder reaction as a case study. Their model displayed great promise, but there is scope for improvement in the model's prediction metrics. In our work, we improve upon their model by introducing non-linearity. This is achieved using multi-gene genetic programming (MGGP). In our methodology, a combination of genetic algorithm (GA) and directed trees was used to develop a branched version of chromosomes, allowing increased possibility of generation of models with high prediction metrics. In our work, prior to model development through MGGP, principal component analysis (PCA) was conducted. Lastly, models were evaluated based on metrics such as R2, Q2, and RMSE.
Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for de... more Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for defining and designing reactions at their maximal potential. In all of these contributions, either the structures of reactants/products have been considered to be unchanging or the solvent structure. Developing a QSPR model which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is essential. Since the structures of reactants and products are related, such QSPR models will serve as a prerequisite for the simultaneous CAMD of reactants, products and solvents. They will also provide a useful tool for predicting the rate constant without relying on experiments. To develop such a QSPR, in our work, the Diels-Alder reaction with different sets of reactants and solvents was investigated. Connectivity indices were used to represent the structures of the members of each set. Principal Component Analysis (PCA) was applied to identify principal components (PCs) corresponding to the structures of reactants and solvent of each set. Linear models expressed in terms of PCs were then generated using a Decision Tree (DT) algorithm such that the R 2 value was maximized. These models formed the initial population on which the GA performed operations such as crossover and mutation to obtain model(s) with best rate constant prediction. Thus, the novelty of our approach is that after feature extraction using PCA, a DT algorithm generates an ensemble of linear models, which through the GA is transformed into a model with best fit. Our approach required much lesser generations to provide a model with highest R 2 ext value as compared to the case where the DT did not initialize the population of models.
Abstract In the United States, cancer is the second leading cause of death. Worldwide too, cancer... more Abstract In the United States, cancer is the second leading cause of death. Worldwide too, cancer is a major health problem. Hence, treatment of cancerous tumors remains a matter of very high concern. Apart from surgical treatment, the most commonly employed treatment is chemotherapy. But, due to long-term side effects such as organ damage and loss of teeth, doctors and patients are interested in treatments with reduced side effects. So far, a reasonably acceptable alternative to chemotherapy has not emerged. Recently, 9-anilinoacridines were evaluated as potential antitumor agents due to their enhanced tendency of DNA binding. For an initial evaluation of the drug performance, the association constant, K, is considered to be the key DNA drug binding property. In our work, to reduce experimental efforts and the associated chemical footprint, we develop a QSPR to model K. In our work, to model K, we utilized descriptors requiring representation of molecular structures in two dimensions or less. To establish a relationship between the descriptors and K, we have developed a correlation based adaptive LASSO algorithm (CorrLASSO). CorrLASSO, like LASSO (least absolute shrinkage and selection operator), incorporates feature selection as part of the learning procedure. Also, it is useful for dealing with high-dimensional data. As an improvement, CorrLASSO evaluates correlation between descriptors/features and the dependent property to generate a model with high performance metrics. In our work, R2, Q2 and MSE (mean square error) were utilized as performance metrics.
Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for de... more Abstract In recent years, Computer-Aided Molecular Design (CAMD) has been extensively used for defining and designing reactions at their maximal potential. In all these contributions, either the reactants/products were considered constant or the solvents. Developing a QSPR model which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is essential. Since the structures of reactants and products are related, such QSPR models will serve as a prerequisite for the simultaneous CAMD of reactants, products and solvents. They will also provide a useful tool for predicting the rate constant without relying on experiments. To develop such a QSPR, in our work, the Diels-Alder reaction with different sets of reactants and solvents was investigated. Connectivity indices were used to represent the structures of the members of each set. Principal Component Analysis (PCA) was applied to identify the principal components (PCs) of each set for further use in model development. These PCs were then used to develop a linear model that best predicts the reaction rates in our study. In this paper, Genetic Algorithm (GA) has been modified using the Decision Tree (DT) algorithm for increased efficiency. Inclusion of DT in GA ensures an initial generation of meaningful combination of descriptors. This set gets further improved in every step of crossover and mutation with applied constraints. Only improvement of generations is accepted due to these constraints. Finally, Multiple Linear Regression (MLR) relates the chosen descriptors with the property under study. The model undergoes thorough internal and external validation to ensure that a best fit model can be found in minimum steps possible.
Abstract This contribution outlines the development of a quantitative structure-property relation... more Abstract This contribution outlines the development of a quantitative structure-property relationship (QSPR) that relates solvent structure to the morphology of ibuprofen crystals grown within that solvent. Morphology can be quantified by aspect ratio, and ibuprofen aspect ratio data was obtained for crystals grown in 16 different organic solvents. A combination of 2D and 3D molecular descriptors are calculated to provide a quantitative representation of the geometry optimized solvent molecules. Empirical force fields were used to estimate the three-dimensional structure of the solvent molecules and three different force fields are implemented while their effect on the developed models is analyzed. The descriptor data matrix, containing a multitude of descriptor types, is reduced in size, using Bayesian Information Criterion (BIC) methods and also Principal Component Analysis (PCA), for regression into linear models through Principal Component Regression. The predictive capabilities of these models were also analyzed through means of internal and external validation methods.
Background: Models that predict postoperative complications often ignore important intraoperative... more Background: Models that predict postoperative complications often ignore important intraoperative events and physiological changes. This study tested the hypothesis that accuracy, discrimination, and precision in predicting postoperative complications would improve when using both preoperative and intraoperative data input data compared with preoperative data alone. Methods: This retrospective cohort analysis included 43,943 adults undergoing 52,529 inpatient surgeries at a single institution during a five-year period. Random forest machine learning models in the validated MySurgeryRisk platform made patient-level predictions for seven postoperative complications and mortality occurring during hospital admission using electronic health record data and patient neighborhood characteristics. For each outcome, one model trained with preoperative data alone; one model trained with both preoperative and intraoperative data. Models were compared by accuracy, discrimination (expressed as AUROC: area under the receiver operating characteristic curve), precision (expressed as AUPRC: area under the precision-recall curve), and reclassification indices. Results: Machine learning models incorporating both preoperative and intraoperative data had greater accuracy, discrimination, and precision than models using preoperative data alone for
Uploads
Papers by Shounak Datta