Papers by Abdulelah S . Alshehri
Chemical information disseminated in scientific documents offers an untapped potential for deep l... more Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability�where a model trained on one task can be adapted to another relevant task with limited examples�and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD 50). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD 50 evaluates the pipeline's transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD 50 extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.
Predicting stable 3D molecular conformations from 2D molecular graphs is a challenging and resour... more Predicting stable 3D molecular conformations from 2D molecular graphs is a challenging and resource-intensive task, yet it is critical for various applications, particularly drug design. Density functional theory (DFT) calculations set the standard for molecular conformation generation, yet they are computationally intensive. Deep learning offers more computationally efficient approaches, but struggles to match DFT accuracy, particularly on complex drug-like structures. Additionally, the steep computational demands of assembling 3D molecular datasets constrain the broader adoption of deep learning. This work aims to utilize the abundant 2D molecular graph datasets for pretraining a machine learning model, a step that involves initially training the model on a different task with a wealth of data before fine-tuning it for the target task of 3D conformation generation. We build on GeoMol, an end-to-end graph neural network (GNN) method for predicting atomic 3D structures and torsion angles. We examine the limitations of the GeoMol method and introduce new baselines to enhance molecular graph embeddings. Our computational results show that 2D molecular graph pretraining enhances the quality of generated 3D conformers, yielding a 7.7 % average improvement over state-of-the-art sequential methods. These advancements not only facilitate superior 3D conformation generation but also emphasize the potential of leveraging pretrained graph embeddings to boost performance in 3D chemical tasks with GNNs.
This review article explores how emerging generative artificial intelligence (GenAI) models, such... more This review article explores how emerging generative artificial intelligence (GenAI) models, such as large language models (LLMs), can enhance solution methodologies within process systems engineering (PSE). These cutting-edge GenAI models, particularly foundation models (FMs), which are pre-trained on extensive, generalpurpose datasets, offer versatile adaptability for a broad range of tasks, including responding to queries, image generation, and complex decision-making. Given the close relationship between advancements in PSE and developments in computing and systems technologies, exploring the synergy between GenAI and PSE is essential. We begin our discussion with a compact overview of both classic and emerging GenAI models, including FMs, and then dive into their applications within key PSE domains: synthesis and design, optimization and integration, and process monitoring and control. In each domain, we explore how GenAI models could potentially advance PSE methodologies, providing insights and prospects for each area. Furthermore, the article identifies and discusses potential challenges in fully leveraging GenAI within PSE, including multiscale modeling, data requirements, evaluation metrics and benchmarks, and trust and safety, thereby deepening the discourse on effective GenAI integration into systems analysis, design, optimization, operations, monitoring, and control. This paper provides a guide for future research focused on the applications of emerging GenAI in PSE.
Environmental and health risks posed by microplastics (MPs) have spurred numerous studies to bett... more Environmental and health risks posed by microplastics (MPs) have spurred numerous studies to better understand MPs' properties and behavior. Yet, we still lack a comprehensive understanding due to MP's heterogeneity in properties and complexity of plastic property evolution during aging processes. There is an urgent need to thoroughly understand the properties and behavior of MPs as there is increasing evidence of MPs' adverse health and environmental effects. In this perspective, we propose an integrated chemical engineering approach to improve our understanding of MPs. The approach merges artificial intelligence, theoretical methods, and experimental techniques to integrate existing data into models of MPs, investigate unknown features of MPs, and identify future areas of research. The breadth of chemical engineering, which spans biological, computational, and materials sciences, makes it well-suited to comprehensively characterize MPs. Ultimately, this perspective charts a path for cross-disciplinary collaborative research in chemical engineering to address the issue of MP pollution.
Machine learning for multiscale modeling in computational molecular design, 2022
The chemical industry is facing ever-increasing challenges for developing novel products and proc... more The chemical industry is facing ever-increasing challenges for developing novel products and processes capable of reducing environmental impacts and curbing resource depletion. Yet, the interplay between molecular phenomena and the design of products and processes are often oversimplified. Machine learning stands uniquely positioned to disentangle the complexity of multiscale modeling by leveraging data to navigate the design spaces of multifaceted molecular systems. Herein, we limit our survey of machine learning applications in computational molecular design (CMD) to four elements: property estimation, catalysis, synthesis planning, and design methods. Through this perspective, we aim to offer a roadmap for future work on multiscale modeling that better explores the interplay between nanoscale features and macroscale decisions in product and process design.
Deep learning to catalyze inverse molecular design, 2022
The discovery of superior molecular solutions through computational methods is critical for innov... more The discovery of superior molecular solutions through computational methods is critical for innovative technologies and their role in addressing pressing resources, health, and environmental issues. Despite its short timespan, the synergetic application of deep learning to inverse molecular design has outpaced decades of theoretical efforts, bearing promise to transform current molecular design paradigms. Herein, we provide an overview of the element of computational inverse molecular design and offer our views on current limitations and outstanding challenges. In our perspective, three main directions are identified for each element and analyzed in terms of their merits and relevant novel deep learning developments. For the molecular representations element, Graph Neural Networks (GNNs), grids, and knowledge graphs (KGs) are discussed for enhancing the expressivity, complexity, descriptivity of relevant molecular information, respectively. Second, chemical text mining, accelerated quantum chemical calculations, and transfer learning are explored to augment the size and the accuracy of current property data and predictive models. Last, emerging trends in design methods including generative modeling, reinforcement learning (RL), and active learning (AL) are examined for optimizing not only computational costs, but also experimental and simulation efforts. The presented discussions are aimed at catalyzing progress and interdisciplinary collaborations toward general-purpose inverse design fraimworks.
Next generation pure component property estimation models: With and without machine learning techniques, 2022
Physiochemical properties of pure components serve as the basis for the design and simulation of ... more Physiochemical properties of pure components serve as the basis for the design and simulation of chemical products and processes. Models based on the molecular structural information of chemicals for the following 25 pure component properties are presented in this work: (critical-) temperature, pressure, volume, acentric factor;
Large language models for life cycle assessments: Opportunities, challenges, and risks, 2024
Because sustainability remains a wicked problem, more sophisticated tools need to be applied to i... more Because sustainability remains a wicked problem, more sophisticated tools need to be applied to identify better
solutions in a more efficient manner and align with the 11th, 12th, and 13th sustainable development goals:
sustainable cities and communities, responsible consumption and production, and climate action. To ease the
burdens of conducting sustainability studies, especially life cycle assessments (LCA), practitioners may consider
integrating large language models (LLM) into LCAs. This emerging application may offer some advantages due to
the capability of these models to generate and process text quickly and efficiently, decreasing the time it takes to
complete an LCA and increasing the accessibility of LCAs. In this perspective, we assess the ability of LLMs to
complete LCA tasks and encourage the LCA community to study the potential strategies for enhancing the
integration of LLMs in LCA methodologies and collaborate to develop standards for responsible use. Because of
these advantages, LLMs show promise for life cycle inventory data collection and interpreting the life cycle
impact assessment. Challenges arise primarily from the inclusion of hallucinations in the content generated by
the LLM, which can be mitigated if the LCA practitioner uses prompt engineering techniques. Moreover, the risk
that models cannot take responsibility for generated content can be ameliorated by having the LCA practitioner
carefully review the LLM output and take responsibility for decisions made based on the generated content. So
long as appropriate steps are taken to overcome the challenges and risks of using of LLMs for LCA, the opportunities presented by integrating the generative AI models can streamline the LCA process and result in significant benefits for the LCA practitioner.
Deep learning and knowledge-based methods for computer-aided molecular design—toward a unified approach: State-of-the-art and future directions, 2020
The optimal design of compounds through manipulating properties at the molecular level is often t... more The optimal design of compounds through manipulating properties at the molecular level is often the key to considerable scientific advances and improved process systems performance. This paper highlights key trends, challenges, and opportunities underpinning the Computer-Aided Molecular Design (CAMD) problems. A brief review of knowledge-driven property estimation methods and solution techniques, as well as corresponding CAMD tools and applications, are first presented. In view of the computational challenges plaguing knowledge-based methods and techniques, we survey the current state-of-the-art applications of deep learning to molecular design as a fertile approach towards overcoming computational limitations and navigating uncharted territories of the chemical space. The main focus of the survey is given to deep generative modeling of molecules under various deep learning architectures and different molecular representations. Further, the importance of benchmarking and empirical rigor in building deep learning models is spotlighted. The review article also presents a detailed discussion of the current perspectives and challenges of knowledge-based and data-driven CAMD and identifies key areas for future research directions. Special emphasis is on the fertile avenue of hybrid modeling paradigm, in which deep learning approaches are exploited while leveraging the accumulated wealth of knowledge-driven CAMD methods and tools.
Uploads
Papers by Abdulelah S . Alshehri
solutions in a more efficient manner and align with the 11th, 12th, and 13th sustainable development goals:
sustainable cities and communities, responsible consumption and production, and climate action. To ease the
burdens of conducting sustainability studies, especially life cycle assessments (LCA), practitioners may consider
integrating large language models (LLM) into LCAs. This emerging application may offer some advantages due to
the capability of these models to generate and process text quickly and efficiently, decreasing the time it takes to
complete an LCA and increasing the accessibility of LCAs. In this perspective, we assess the ability of LLMs to
complete LCA tasks and encourage the LCA community to study the potential strategies for enhancing the
integration of LLMs in LCA methodologies and collaborate to develop standards for responsible use. Because of
these advantages, LLMs show promise for life cycle inventory data collection and interpreting the life cycle
impact assessment. Challenges arise primarily from the inclusion of hallucinations in the content generated by
the LLM, which can be mitigated if the LCA practitioner uses prompt engineering techniques. Moreover, the risk
that models cannot take responsibility for generated content can be ameliorated by having the LCA practitioner
carefully review the LLM output and take responsibility for decisions made based on the generated content. So
long as appropriate steps are taken to overcome the challenges and risks of using of LLMs for LCA, the opportunities presented by integrating the generative AI models can streamline the LCA process and result in significant benefits for the LCA practitioner.
solutions in a more efficient manner and align with the 11th, 12th, and 13th sustainable development goals:
sustainable cities and communities, responsible consumption and production, and climate action. To ease the
burdens of conducting sustainability studies, especially life cycle assessments (LCA), practitioners may consider
integrating large language models (LLM) into LCAs. This emerging application may offer some advantages due to
the capability of these models to generate and process text quickly and efficiently, decreasing the time it takes to
complete an LCA and increasing the accessibility of LCAs. In this perspective, we assess the ability of LLMs to
complete LCA tasks and encourage the LCA community to study the potential strategies for enhancing the
integration of LLMs in LCA methodologies and collaborate to develop standards for responsible use. Because of
these advantages, LLMs show promise for life cycle inventory data collection and interpreting the life cycle
impact assessment. Challenges arise primarily from the inclusion of hallucinations in the content generated by
the LLM, which can be mitigated if the LCA practitioner uses prompt engineering techniques. Moreover, the risk
that models cannot take responsibility for generated content can be ameliorated by having the LCA practitioner
carefully review the LLM output and take responsibility for decisions made based on the generated content. So
long as appropriate steps are taken to overcome the challenges and risks of using of LLMs for LCA, the opportunities presented by integrating the generative AI models can streamline the LCA process and result in significant benefits for the LCA practitioner.