0% found this document useful (0 votes)
19 views

Unit 1

FUNDAMENTALS OF HEALTHCARE ANALYTICS REGULATION 2021

Uploaded by

suhagaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit 1

FUNDAMENTALS OF HEALTHCARE ANALYTICS REGULATION 2021

Uploaded by

suhagaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Introduction

While the healthcare costs have been constantly rising, the quality of care provided to the
patients in the United States have not seen considerable improvements. Recently, several
researchers have conducted studies which showed that by incorporating the current healthcare
technologies, they are able to reduce mortality rates, healthcare costs, and medical complications
at various hospitals. In 2009, the US government enacted the Health Information Technology for
Economic and Clinical Health Act (HITECH) that includes an incentive program (around $27
billion) for the adoption and meaningful use of Electronic Health Records (EHRs).
The recent advances in information technology have led to an increasing ease in the ability to
collect various forms of healthcare data. In this digital world, data becomes an integral part of
healthcare. A recent report on Big Data suggests that the overall potential of healthcare data will
be around $300 billion [12]. Due to the rapid advancements in the data sensing and acquisition
technologies, hospitals and healthcare institutions have started collecting vast amounts of
healthcare data about their patients. Effectively understanding and building knowledge from
healthcare data requires developing advanced analytical techniques that can effectively transform
data into meaningful and actionable information. General computing technologies have started
revolutionizing the manner in which medical care is available to the patients. Data analytics, in
particular, forms a critical component of these computing technologies. The analytical solutions
when applied to healthcare data have an immense potential to transform healthcare delivery from
being reactive to more proactive. The impact of analytics in the healthcare domain is only going
to grow more in the next several years. Typically, analyzing health data will allow us to
understand the patterns that are hidden in the data. Also, it will help the clinicians to build an
individualized patient profile and can accurately compute the likelihood of an individual patient
to suffer from a medical complication in the near future.
Healthcare data is particularly rich and it is derived from a wide variety of sources such as
sensors, images, text in the form of biomedical literature/clinical notes, and traditional electronic
records. This heterogeneity in the data collection and representation process leads to numerous
challenges in both the processing and analysis of the underlying data. There is a wide diversity in
the techniques that are required to analyze these different forms of data. In addition, the
heterogeneity of the data naturally creates various data integration and data analysis challenges.
In many cases, insights can be obtained from diverse data types, which are otherwise not
possible from a single source of the data. It is only recently that the vast potential of such
integrated data analysis methods is being realized.
From a researcher and practitioner perspective, a major challenge in healthcare is its
interdisciplinary nature. The field of healthcare has often seen advances coming from diverse
disciplines such as databases, data mining, information retrieval, medical researchers, and
healthcare practitioners. While this interdisciplinary nature adds to the richness of the field, it
also adds to the challenges in making significant advances. Computer scientists are usually not
trained in domain-specific medical concepts, whereas medical practitioners and researchers also
have limited exposure to the mathematical and statistical background required in the data
analytics area. This has added to the difficulty in creating a coherent body of work in this field
even though it is evident that much of the available data can benefit from such advanced analysis
techniques. The result of such a diversity has often led to independent lines of work from
completely different perspectives. Researchers in the field of data analytics are particularly
susceptible to becoming isolated from real domain-specific problems, and may often propose
problem formulations with excellent technique but with no practical use. This book is an attempt
to bring together these diverse communities by carefully and comprehensively discussing the
most relevant contributions from each domain. It is only by bringing together these diverse
communities that the vast potential of data analysis methods can be harnessed.
FIGURE 1.1: The overall organization of the book’s contents.
Another major challenge that exists in the healthcare domain is the “data privacy gap” between
medical researchers and computer scientists. Healthcare data is obviously very sensitive because
it can reveal compromising information about individuals. Several laws in various countries,
such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States,
explicitly forbid the release of medical information about individuals for any purpose, unless
safeguards are used to preserve privacy. Medical researchers have natural access to healthcare
data because their research is often paired with an actual medical practice. Furthermore, various
mechanisms exist in the medical domain to conduct research studies with voluntary participants.
Such data collection is almost always paired with anonymity and confidentiality agreements.
On the other hand, acquiring data is not quite as simple for computer scientists without a proper
collaboration with a medical practitioner. Even then, there are barriers in the acquisition of data.
Clearly, many of these challenges can be avoided if accepted protocols, privacy technologies,
and safeguards are in place. Therefore, this book will also address these issues. Figure 1.1
provides an overview of the organization of the book’s contents. This book is organized into
three parts:
1. Healthcare Data Sources and Basic Analytics: This part discusses the details of various
healthcare data sources and the basic analytical methods that are widely used in the
processing and analysis of such data. The various forms of patient data that is currently
being collected in both clinical and non-clinical environments will be studied. The clinical
data will have the structured electronic health records and biomedical images. Sensor data
has been receiving a lot attention recently. Techniques for mining sensor data and
biomedical signal analysis will be presented. Personalized medicine has gained a lot of
importance due to the advancements in genomic data. Genomic data analysis involves
several statistical techniques. These will also be elaborated. Patients’ in-hospital clinical
data will also include a lot of unstructured data in the form of clinical notes. In addition, the
domain knowledge that can be extracted by mining the biomedical literature, will also be
discussed. The fundamental data mining, machine learning, information retrieval, and
natural language processing techniques for processing these data types will be extensively
discussed. Finally, behavioral data captured through social media will also be discussed.
2. Advanced Data Analytics for Healthcare: This part deals with the advanced analytical
methods focused on healthcare. This includes the clinical prediction models, temporal data
mining methods, and visual analytics. Integrating heterogeneous data such as clinical and
genomic data is essential for improving the predictive power of the data that will also be
discussed. Information retrieval techniques that can enhance the quality of biomedical
search will be presented. Data privacy is an extremely important concern in healthcare.
Privacy-preserving data publishing techniques will therefore be presented.
3. Applications and Practical Systems for Healthcare: This part focuses on the practical
applications of data analytics and the systems developed using data analytics for healthcare
and clinical practice. Examples include applications of data analytics to pervasive
healthcare, fraud detection, and drug discovery. In terms of the practical systems, we will
discuss the details about the clinical decision support systems, computer assisted medical
imaging systems, and mobile imaging systems.
These different aspects of healthcare are related to one another. Therefore, the chapters in each of
the aforementioned topics are interconnected. Where necessary, pointers are provided across
different chapters, depending on the underlying relevance. This chapter is organized as follows.
Section 1.2 discusses the main data sources that are commonly used and the basic techniques for
processing them. Section 1.3 discusses advanced techniques in the field of healthcare data
analytics. Section 1.4 discusses a number of applications of healthcare analysis techniques. An
overview of resources in the field of healthcare data analytics is presented in Section 1.5. Section
1.6 presents the conclusions.

Computers and bio statistical analysis

Healthcare Data Sources and Basic Analytics


In this section, the various data sources and their impact on analytical algorithms will be
discussed. The heterogeneity of the sources for medical data mining is rather broad, and this
creates the need for a wide variety of techni ques drawn from different domains of data
analytics.

1.2.1 Electronic Health Records

Electronic health records (EHRs) contain a digitized version of a patient’s medical history. It
encompasses a full range of data relevant to a patient’s care such as demographics, problems,
medications, physician’s observations, vital signs, medical history, laboratory data, radiology
reports, progress notes, and billing data. Many EHRs go beyond a patient’s medical or treatment
history and may contain additional broader perspectives of a patient’s care. An important
property of EHRs is that they provide an effective and efficient way for healthcare providers and
organizations to share with one another. In this context, EHRs are inherently designed to be in
real time and they can instantly be accessed and edited by authorized users. This can be very
useful in practical settings. For example, a hospital or specialist may wish to access the medical
records of the primary provider. An electronic health record streamlines the workflow by
allowing direct access to the updated records in real time [30]. It can generate a complete record
of a patient’s clinical encounter, and support other care-related activities such as evidence-based
decision support, quality management, and outcomes reporting. The storage and retrieval of
health-related data is more efficient using EHRs. It helps to improve quality and convenience of
patient care, increase patient participation in the healthcare process, improve accuracy of
diagnoses and health outcomes, and improve care coordination [29]. Various components of
EHRs along with the advantages, barriers, and challenges of using EHRs are discussed in
Chapter 2.

1.2.2 Biomedical Image Analysis


Medical imaging plays an important role in modern-day tehealthcare due to its immense
capability in providing high-quality images of anatomical structures in human beings. Effectively
analyzing such images can be useful for clinicians and medical researchers since it can aid
disease monitoring, treatment planning, and prognosis [31]. The most popular imaging
modalities used to acquire a biomedical image are magnetic resonance imaging (MRI), computed
tomography (CT), positron emission tomography (PET), and ultrasound (U/S). Being able to
look inside of the body without hurting the patient and being able to view the human organs has
tremendous implications on human health. Such capabilities allow the physicians to better
understand the cause of an illness or other adverse conditions without cutting open the patient.
However, merely viewing such organs with the help of images is just the first step of the process.
The final goal of biomedical image analysis is to be able to generate quantitative information and
make inferences from the images that can provide far more insights into a medical condition.
Such analysis has major societal significance since it is the key to understanding biological
systems and solving health problems. However, it includes many challenges since the images are
varied, complex, and can contain irregular shapes with noisy values. A number of general
categories of research problems that arise in analyzing images are object detection, image
segmentation, image registration, and feature extraction. All these challenges when resolved will
enable the generation of meaningful analytic measurements that can serve as inputs to other areas
of healthcare data analytics. Chapter 3 discusses a broad overview of the main medical imaging
modalities along with a wide range of image analysis approaches.

1.2.3 Sensor Data Analysis

Sensor data [2] is ubiquitous in the medical domain both for real time and for retrospective
analysis. Several forms of medical data collection instruments such as electrocardiogram (ECG),
and electroencaphalogram(EEG) are essentially sensors that collect signals from various parts of
the human body [32]. These collected data instruments are sometimes used for retrospective
analysis, but more often for real-time analysis. Perhaps, the most important use-case of real-time
analysis is in the context of intensive care units (ICUs) and real-time remote monitoring of
patients with specific medical conditions. In all these cases, the volume of the data to the
processed can be rather large. For example, in an ICU, it is not uncommon for the sensor to
receive input from hundreds of data sources, and alarms need to be triggered in real time. Such
applications necessitate the use of big-data frameworks and specialized hardware platforms. In
remote-monitoring applications, both the real-time events and a long-term analysis of various
trends and treatment alternatives is of great interest.
While rapid growth in sensor data offers significant promise to impact healthcare, it also
introduces a data overload challenge. Hence, it becomes extremely important to develop novel
data analytical tools that can process such large volumes of collected data into meaningful and
interpretable knowledge. Such analytical methods will not only allow for better observing
patients’ physiological signals and help provide situational awareness to the bedside, but also
provide better insights into the inefficiencies in the healthcare system that may be the root cause
of surging costs. The research challenges associated with the mining of sensor data in healthcare
settings and the sensor mining applications and systems in both clinical and non-clinical settings
is discussed in Chapter 4.

1.2.4 Biomedical Signal Analysis

Biomedical Signal Analysis consists of measuring signals from biological sources, the origin of
which lies in various physiological processes. Examples of such signals include the
electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG),
electroencephalogram (EEG), electrogastrogram (EGG), phonocardiogram(PCG), and so on. The
analysis of these signals is vital in diagnosing the pathological conditions and in deciding an
appropriate care pathway. The measurement of physiological signals gives some form of
quantitative or relative assessment of the state of the human body. These signals are acquired
from various kinds of sensors and transducers either invasively or non-invasively.
These signals can be either discrete or continuous depending on the kind of care or severity of a
particular pathological condition. The processing and interpretation of physiological signals is
challenging due to the low signal-to-noise ratio (SNR) and the interdependencyof the
physiological systems. The signal data obtained from the corresponding medical instruments can
be copiously noisy, and may sometimes require a significant amount of preprocessing. Several
signal processing algorithms have been developed that have significantly enhanced the
understanding of the physiological processes. A wide variety of methods are used for filtering,
noise removal, and compact methods [36]. More sophisticated analysis methods including
dimensionality reduction techniques such as Principal Component Analysis (PCA), Singular
Value Decomposition (SVD), and wavelet transformation have also been widely investigated in
the literature. A broader overview of many of these techniques may also be found in [1, 2]. Time-
series analysis methods are discussed in [37, 40]. Chapter 5 presents an overview of various
signal processing techniques used for processing biomedical signals.

1.2.5 Genomic Data Analysis

A significant number of diseases are genetic in nature, but the nature of the causality between the
genetic markers and the diseases has not been fully established. For example, diabetes is well
known to be a genetic disease; however, the full set of genetic markers that make an individual
prone to diabetes are unknown. In some other cases, such as the blindness caused by Stargardt
disease, the relevant genes are known but all the possible mutations have not been exhaustively
isolated. Clearly, a broader understanding of the relationships between various genetic markers,
mutations, and disease conditions has significant potential in assisting the development of
various gene therapies to cure these conditions. One will be mostly interested in understanding
what kind of health-related questions can be addressed through in-silico analysis of the genomic
data through typical data-driven studies. Moreover, translating genetic discoveries into
personalized medicine practice is a highly non-trivial task with a lot of unresolved challenges.
For example, the genomic landscapes in complex diseases such as cancers are overwhelmingly
complicated, revealing a high order of heterogeneity among different individuals. Solving these
issues will be fitting a major piece of the puzzle and it will bring the concept of personalized
medicine much more closer to reality.
Recent advancements made in the biotechnologies have led to the rapid generation of large
volumes of biological and medical information and advanced genomic research. This has also led
to unprecedented opportunities and hopes for genome scale study of challenging problems in life
science. For example, advances in genomic technology made it possible to study the complete
genomic landscape of healthy individuals for complex diseases [16]. Many of these research
directions have already shown promising results in terms of generating new insights into the
biology of human disease and to predict the personalized response of the individual to a
particular treatment. Also, genetic data are often modeled either as sequences or as networks.
Therefore, the work in this field requires a good understanding of sequence and network mining
techniques. Various data analytics-based solutions are being developed for tackling key research
problems in medicine such as identification of disease biomarkers and therapeutic targets and
prediction of clinical outcome. More details about the fundamental computational algorithms and
bioinformatics tools for genomic data analysis along with genomic data resources are discussed
in Chapter 6.

1.2.6 Clinical Text Mining

Most of the information about patients is encoded in the form of clinical notes. These notes are
typically stored in an unstructured data format and is the backbone of much of healthcare data.
These contain the clinical information from the transcription of dictations, direct entry by
providers, or use of speech recognition applications. These are perhaps the richest source of
unexploited information. It is needless to say that the manual encoding of this free-text form on a
broad range of clinical information is too costly and time consuming, though it is limited to
primary and secondary diagnoses, and procedures for billing purposes. Such notes are
notoriously challenging to analyze automatically due to the complexity involved in converting
clinical text that is available in free-text to a structured format. It becomes hard mainly because
of their unstructured nature, heterogeneity, diverse formats, and varying context across different
patients and practitioners.
Natural language processing (NLP) and entity extraction play an important part in inferring
useful knowledge from large volumes of clinical text to automatically encoding clinical
information in a timely manner[22]. In general, data preprocessing methods are more importantin
these contexts as compared to the actual mining techniques. The processing of clinical text using
NLP methods is more challenging when compared to the processing of other texts due to the
ungrammatical nature of short and telegraphic phrases, dictations, shorthand lexicons such as
abbreviations and acronyms, and often misspelled clinical terms. All these problems will have a
direct impact on the various standard NLP tasks such as shallow or full parsing, sentence
segmentation, text categorization, etc., thus making the clinical text processing highly
challenging. A wide range of NLP methods and data mining techniques for extracting
information from the clinical text are discussed in Chapter 7.

1.2.7 Mining Biomedical Literature


A significant number of applications rely on evidence from the biomedical literature. The latter is
copious and has grown significantly over time. The use of text mining methods for the long-term
preservation, accessibility, and usability of digitally available resources is important in
biomedical applicationsrelying on evidencefromscientific literature. Text miningmethods and
tools offer novel ways of applying new knowledge discovery methods in the biomedical field
[21][20]. Such tools offer efficient ways to search, extract, combine, analyze and summarize
textual data, thus supporting researchers in knowledge discovery and generation. One of the
major challenges in biomedical text mining is the multidisciplinary nature of the field. For
example, biologists describe chemical compounds using brand names, while chemists often use
less ambiguous IUPAC-compliant names or unambiguous descriptors such as International
Chemical Identifiers. While the latter can be handled with cheminformatics tools, text mining
techniques are required to extract less precisely defined entities and their relations from the
literature. In this context, entity and event extraction methods play a key role in discovering
useful knowledge from unstructured databases. Because the cost of curating such databases is too
high, text mining methods offer new opportunities for their effective population, update, and
integration. Text mining brings about other benefits to biomedical research by linking textual
evidence to biomedical pathways, reducing the cost of expert knowledge validation, and
generating hypotheses. The approach provides a general methodology to discover previously
unknown links and enhance the way in which biomedical knowledge is organized. More details
about the challenges and algorithms for biomedical text mining are discussed in Chapter 8.

1.2.8 Social Media Analysis

The rapid emergence of various social media resources such as social networking sites,
blogs/microblogs, forums, question answering services, and online communities provides a
wealth of information about public opinion on various aspects of healthcare. Social media data
can be mined for patterns and knowledge that can be leveraged to make useful inferences about
population health and public health monitoring. A significant amount of public health
information can be gleaned from the inputs of various participants at social media sites. Although
most individual social media posts and messages contain little informational value, aggregation
of millions of such messages can generate important knowledge [4, 19]. Effectively analyzing
these vast pieces of knowledge can significantly reduce the latency in collecting such complex
information.
Previous research on social media analytics for healthcare has focused on capturing aggregate
health trends such as outbreaks of infectious diseases, detecting reports of adverse drug
interactions, and improving interventional capabilities for health-related activities. Disease
outbreak detection is often strongly reflected in the content of social media and an analysis of the
history of the content provides valuable insights about disease outbreaks. Topic models are
frequently used for high-level analysis of such health-related content. An additional source of
information in social media sites is obtained from online doctor and patient communities. Since
medical conditions recur across different individuals, the online communities provide a valuable
source of knowledge about various medical conditions. A major challenge in social media
analysis is that the data is often unreliable, and therefore the results must be interpreted with
caution. More discussion about the impact of social media analytics in improving healthcare is
given in Chapter 9.

1.3 Advanced Data Analytics for Healthcare


This section will discuss a number of advanced data analytics methods for healthcare. These
techniques include various data mining and machine learning models that need to be adapted to
the healthcare domain.

1.3.1 Clinical Prediction Models

Clinical prediction forms a critical component of modern-day healthcare. Several prediction


models have been extensively investigated and have been successfully deployed in clinical
practice [26]. Such models have made a tremendous impact in terms of diagnosis and treatment
of diseases. Most successful supervised learning methods that have been employed for clinical
prediction tasks fall into three categories: (i) Statistical methods such as linear regression,
logistic regression, and Bayesian models; (ii) Sophisticated methods in machine learning and
data mining such as decision trees and artificial neural networks; and (iii) Survival models that
aim to predict survival outcomes. All of these techniquesfocus on discoveringthe
underlyingrelationshipbetween covariatevariables, which are also known as attributes and
features, and a dependent outcome variable.
The choice of the model to be used for a particular healthcare problem primarily depends on the
outcomes to be predicted. There are various kinds of prediction models that are proposed in the
literature for handling such a diverse variety of outcomes. Some of the most common outcomes
include binary and continuous forms. Other less common forms are categorical and ordinal
outcomes. In addition, there are also different models proposed to handle survival outcomes
where the goal is to predict the time of occurrence of a particular event of interest. These survival
models are also widely studied in the context of clinical data analysis in terms of predicting the
patient’s survival time. There are different ways of evaluatingand validating the performanceof
these predictionmodels. Different prediction models along with various kinds of evaluation
mechanisms in the context of healthcare data analytics will be discussed in Chapter 10.

1.3.2 Temporal Data Mining

Healthcare data almost always contain time information and it is inconceivable to reason and
mine these data without incorporating the temporal dimension. There are two major sources of
temporal data generated in the healthcare domain. The first is the electronic health records (EHR)
data and the second is the sensor data. Mining the temporal dimension of EHR data is extremely
promising as it may reveal patterns that enable a more precise understanding of disease
manifestation, progression and response to therapy. Some of the unique characteristics of EHR
data (such as of heterogeneous, sparse, high-dimensional, irregular time intervals) makes
conventional methods inadequate to handle them. Unlike EHR data, sensor data are usually
represented as numeric time series that are regularly measured in time at a high frequency.
Examples of these data are physiological data obtained by monitoring the patients on a regular
basis and other electrical activity recordings such as electrocardiogram (ECG),
electroencephalogram (EEG), etc. Sensor data for a specific subject are measured over a much
shorter period of time (usually several minutes to several days) compared to the longitudinal
EHR data (usually collected across the entire lifespan of the patient).
Given the different natures of EHR data and sensor data, the choice of appropriate temporal data
mining methods for these types of data are often different. EHR data are usually mined using
temporal pattern mining methods, which represent data instances (e.g., patients’ records) as
sequences of discrete events (e.g., diagnosis codes, procedures, etc.) and then try to find and
enumerate statistically relevant patterns that are embedded in the data. On the other hand, sensor
data are often analyzed using signal processing and time-series analysis techniques (e.g., wavelet
transform, independent component analysis, etc.) [37, 40]. Chapter 11 presents a detailed survey
and summarizes the literature on temporal data mining for healthcare data.

1.3.3 Visual Analytics

The ability to analyze and identify meaningful patterns in multimodal clinical data must be
addressed in order to provide a better understanding of diseases and to identify patterns that
could be affecting the clinical workflow. Visual analytics provides a way to combine the
strengths of human cognition with interactive interfaces and data analytics that can facilitate the
exploration of complex datasets. Visual analytics is a science that involves the integration of
interactive visual interfaces with analytical techniques to develop systems that facilitate
reasoning over, and interpretation of, complex data [23]. Visual analytics is popular in many
aspects of healthcare data analysis because of the wide variety of insights that such an analysis
provides. Due to the rapid increase of health-related information, it becomes critical to build
effective ways of analyzing large amounts of data by leveraging human–computer interaction
and graphical interfaces. In general, providing easily understandable summaries of complex
healthcare data is useful for a human in gaining novel insights.
In the evaluation of many diseases, clinicians are presented with datasets that often contain
hundreds of clinical variables. The multimodal, noisy, heterogeneous, and temporal
characteristics of the clinical data pose significant challenges to the users while synthesizing the
information and obtaining insights from the data [24]. The amount of information being
producedby healthcare organizations opens up opportunities to design new interactive interfaces
to explore large-scale databases, to validate clinical data and coding techniques, and to increase
transparency within different departments, hospitals, and organizations. While many of the visual
methods can be directly adopted from the data mining literature [11], a number of methods,
which are specific to the healthcare domain, have also been designed. A detailed discussion on
the popular data visualization techniques used in clinical settings and the areas in healthcare that
benefit from visual analytics are discussed in Chapter 12.

1.3.4 Clinico–Genomic Data Integration


Human diseases are inherently complex in nature and are usually governed by a complicated
interplay of several diverse underlying factors, including different genomic, clinical, behavioral,
and environmental factors. Clinico–pathological and genomic datasets capture the different
effects of these diverse factors in a complementary manner. It is essential to build integrative
models considering both genomic and clinical variables simultaneously so that they can combine
the vital information that is present in both clinical and genomic data [27]. Such models can help
in the design of effective diagnostics, new therapeutics, and novel drugs, which will lead us one
step closer to personalized medicine [17].
This opportunity has led to an emerging area of integrative predictive models that can be built by
combining clinical and genomic data, which is called clinico–genomic data integration. Clinical
data refers to a broad category of a patient’s pathological, behavioral, demographic, familial,
environmental and medication history, while genomic data refers to a patient’s genomic
information including SNPs, gene expression, protein and metabolite profiles. In most of the
cases, the goal of the integrative study is biomarker discovery which is to find the clinical and
genomic factors related to a particular disease phenotype such as cancer vs. no cancer, tumor vs.
normal tissue samples, or continuous variables such as the survival time after a particular
treatment. Chapter 13 provides a comprehensive survey of different challenges with clinico–
genomic data integration along with the different approaches that aim to address these challenges
with an emphasis on biomarker discovery.

1.3.5 Information Retrieval

Althoughmost work in healthcaredata analytics focuses on miningand analyzingpatient-related


data, additional information for use in this process includes scientific data and literature. The
techniques most commonly used to access this data include those from the field of information
retrieval (IR). IR is the field concerned with the acquisition, organization, and searching of
knowledge-based information, which is usually defined as information derived and organized
from observational or experimental research [14]. The use of IR systems has become essentially
ubiquitous. It is estimated that among individuals who use the Internet in the United States, over
80 percent have used it to search for personal health information and virtually all physicians use
the Internet.
Information retrieval models are closely related to the problems of clinical and biomedical text
mining. The basic objective of using information retrieval is to find the content that a user
wanted based on his requirements. This typically begins with the posing of a query to the IR
system. A search engine matches the query to content items through metadata. The two key
components of IR are: Indexing, which is the process of assigning metadata to the content, and
retrieval, which is the process of the user entering the query and retrieving relevant content. The
most well-known data structure used for efficient information retrieval is the inverted index
where each document is associated with an identifier. Each word then points to a list of
document identifiers. This kind of representation is particularly useful for a keyword search.
Furthermore, once a search has been conducted, mechanisms are required to rank the possibly
large number of results, which might have been retrieved. A number of user-oriented evaluations
have been performed over the years looking at users of biomedical information and measuring
the search performance in clinical settings [15]. Chapter 14 discusses a number of information
retrieval models for healthcare along with evaluation of such retrieval models.

1.3.6 Privacy-Preserving Data Publishing

In the healthcare domain, the definition of privacy is commonly accepted as “a person’s right
and desire to control the disclosure of their personal health information” [25]. Patients’ health-
related data is highly sensitive because of the potentially compromising information about
individual participants. Various forms of data such as disease information or genomic
information may be sensitive for different reasons. To enable research in the field of medicine, it
is often important for medical organizations to be able to share their data with statistical experts.
Sharing personal health information can bring enormous economical benefits. This naturally
leads to concerns about the privacy of individuals being compromised. The data privacy problem
is one of the most important challenges in the field of healthcare data analytics. Most privacy
preservation methods reduce the representation accuracy of the data so that the identification of
sensitive attributes of an individual is compromised. This can be achieved by either perturbing
the sensitive attribute, perturbing attributes that serve as identification mechanisms, or a
combination of the two. Clearly, this process required the reduction in the accuracy of data
representation. Therefore, privacy preservation almost always incurs the cost of losing some data
utility. Therefore, the goal of privacy preservation methods is to optimize the trade-off between
utility and privacy. This ensures that the amount of utility loss at a given level of privacy is as
little as possible.
The major steps in privacy-preserving data publication algorithms [5][18] are the identification
of an appropriate privacy metric and level for a given access setting and data characteristics,
application of one or multiple privacy-preserving algorithm(s) to achieve the desired privacy
level, and post analyzing the utility of the processed data. These three steps are repeated until the
desired utility and privacy levels are jointly met. Chapter 15 focuses on applying privacy-
preserving algorithms to healthcare data for secondary-use data publishing and interpretation of
the usefulness and implications of the processed data.

1.4 Applications and Practical Systems for Healthcare


In the final set of chapters in this book, we will discuss the practical healthcare applications and
systems that heavily utilize data analytics. These topics have evolved significantly in the past few
years and are continuing to gain a lot of momentum and interest. Some of these methods, such as
fraud detection, are not directly related to medical diagnosis, but are nevertheless important in
this domain.

1.4.1 Data Analytics for Pervasive Health

Pervasive health refers to the process of tracking medical well-being and providing long-term
medical care with the use of advanced technologies such as wearable sensors. For example,
wearable monitors are often used for measuring the long-term effectiveness of various treatment
mechanisms. These methods, however, face a number of challenges, such as knowledge
extraction from the large volumes of data collected and real-time processing. However, recent
advances in both hardware and software technologies (data analytics in particular) have made
such systems a reality. These advances have made low cost intelligent health systems embedded
within the home and living environments a reality [33].
A wide variety of sensor modalities can be used when developing intelligent health systems,
including wearable and ambient sensors [28]. In the case of wearable sensors, sensors are
attached to the body or woven into garments. For example, 3-axis accelerometers distributed
over an individual’s body can provide information about the orientation and movement of the
corresponding body part. In addition to these advancements in sensing modalities, there has been
an increasing interest in applying analytics techniques to data collected from such equipment.
Several practical healthcare systems have started using analytical solutions. Some examples
include cognitive health monitoring systems based on activity recognition, persuasive systems
for motivating users to change their health and wellness habits, and abnormal health condition
detection systems.

1.4.2 Healthcare Fraud Detection

Healthcare fraud has been one of the biggest problems faced by the United States and costs
several billions of dollars every year. With growing healthcare costs, the threat of healthcare
fraud is increasing at an alarming pace. Given the recent scrutiny of the inefficiencies in the US
healthcare system, identifying fraud has been on the forefront of the efforts towards reducing the
healthcare costs. One could analyze the healthcare claims data along different dimensions to
identify fraud. The complexity of the healthcare domain, which includes multiple sets of
participants, including healthcare providers, beneficiaries (patients), and insurance companies,
makes the problem of detecting healthcare fraud equally challenging and makes it different from
other domains such as credit card fraud detection and auto insurance fraud detection. In these
other domains, the methods rely on constructing profiles for the users based on the historical data
and they typically monitor deviations in the behavior of the user from the profile [7]. However,
in healthcare fraud, such approaches are not usually applicable, because the users in the
healthcare setting are the beneficiaries, who typically are not the fraud perpetrators. Hence, more
sophisticated analysis is required in the healthcare sector to identify fraud.
Several solutions based on data analytics have been investigated for solving the problem of
healthcare fraud. The primary advantages of data-driven fraud detection are automatic extraction
of fraud patterns and prioritization of suspicious cases [3]. Most of such analysis is performed
with respect to an episode of care, which is essentially a collection of healthcare provided to a
patient under the same health issue.

1.4.3Data Analytics for Pharmaceutical Discoveries

The cost of successful novel chemistry-based drug development often reaches millions of
dollars, and the time to introduce the drug to market often comes close to a decade [34]. The high
failure rate of drugs during this process, make the trial phases known as the “valley of death.”
Most new compounds fail during the FDA approval process in clinical trials or cause adverse
side effects. Interdisciplinary computational approaches that combine statistics, computer
science, medicine, chemoinformatics, and biology are becoming highly valuable for drug
discovery and development. In the context of pharmaceutical discoveries, data analytics can
potentially limit the search space and provide recommendations to the domain experts for
hypothesis generation and further analysis and experiments.
Data analytics can be used in several stages of drug discovery and development to achieve
different goals. In this domain, one way to categorize data analytical approaches is based on their
application to pre-marketing and post-marketing stages of the drug discovery and development
process. In the pre-marketing stage, data analytics focus on discovery activities such as finding
signals that indicate relations between drugs and targets, drugs and drugs, genes and diseases,
protein and diseases, and finding biomarkers. In the post-marketing stage an important
application of data analytics is to find indications of adverse side effects for approved drugs.
These methods provide a list of potential drug side effect associations that can be used for further
studies. Chapter 18 provides more discussion of the applications of data analytics for
pharmaceutical discoveries including drug-target interaction prediction and pharmacovigilance.

1.4.4 Clinical Decision Support Systems

Clinical Decision Support Systems (CDSS) are computer systems designed to assist clinicians
with patient-related decision making, such as diagnosis and treatment [6]. CDSS have become a
crucial component in the evaluation and improvement of patient treatment since they have shown
to improve both patient outcomes and cost of care [35]. They can help in minimizing analytical
errors by notifying the physician of potentially harmful drug interactions, and their diagnostic
procedures have been shown to enable more accurate diagnoses. Some of the main advantages of
CDSS are their ability in decision making and determining optimal treatment strategies, aiding
general health policies by estimating the clinical and economic outcomes of different treatment
methods and even estimating treatment outcomes under certain conditions. The main reason for
the success of CDSS are their electronic nature, seamless integration with clinical workflows,
providing decision support at the appropriate time/location. Two particular fields of healthcare
where CDSS have been extremely influential are pharmacy and billing. CDSS can help
pharmacies to look for negative drug interactions and then report them to the corresponding
patient’s ordering professional. In the billing departments, CDSS have been used to devise
treatment plans that provide an optimal balance of patient care and financial expense.

1.4.5 Computer-Aided Diagnosis

Computer-aided diagnosis/detection (CAD) is a procedure in radiology that supports radiologists


in reading medical images [13]. CAD tools in general refer to fully automated second reader
tools designed to assist the radiologist in the detection of lesions. There is a growing consensus
among clinical experts that the use of CAD tools can improve the performance of the radiologist.
The radiologist first performs an interpretation of the images as usual, while the CAD algorithms
is running in the background or has already been precomputed. Structures identified by the CAD
algorithm are then highlighted as regions of interest to the radiologist. The principal value of
CAD tools is determined not by its stand-alone performance, but rather by carefully measuring
the incremental value of CAD in normal clinical practice, such as the number of additional
lesions detected using CAD. Secondly, CAD systems must not have a negative impact on patient
management (for instance, false positives that cause the radiologist to recommend unnecessary
biopsies and followups).
From the data analytics perspective, new CAD algorithms aim at extracting key quantitative
features, summarizing vast volumes of data, and/or enhancing the visualization of potentially
malignant nodules, tumors, or lesions in medical images. The three important stages in the CAD
data processing are candidate generation (identifying suspicious regions of interest), feature
extraction (computing descriptive morphological or texture features), and classification
(differentiating candidates that are true lesions from the rest of the candidates based on candidate
feature vectors).

1.4.6Mobile Imaging for Biomedical Applications

Mobile imaging refers to the application of portable computers such as smartphones or tablet
computers to store, visualize, and process images with and without connections to servers, the
Internet, or the cloud. Today, portable devices provide sufficient computational power for
biomedical image processing and smart devices have been introducedin the operationtheater.
While many techniques for biomedical image acquisition will always require special equipment,
the regular camera is one of the most widely used imaging modality in hospitals. Mobile
technology and smart devices, especially smartphones, allows new ways of easier imaging at the
patient’s bedside and possess the possibility to be made into a diagnostic tool that can be used by
medical professionals. Smartphones usually contain at least one high-resolution camera that can
be used for image formation. Several challenges arise during the acquisition, visualization,
analysis, and management of images in mobile environments.

1.5 Resources for Healthcare Data Analytics


There are several resources available in this field. We will now discuss the various books,
journals, and organizations that provide further information on this exciting area of healthcare
informatics.
Due to the privacy of the medical data that typically contains highly sensitive patient
information, the research work in the healthcare data analytics has been fragmented into various
places. Many researchers work with a specific hospital or a healthcare facility that are usually not
willing to share their data due to obvious privacy concerns. However, there are a wide variety of
public repositories available for researchers to design and apply their own models and
algorithms. Due to the diversity in healthcare research, it will be a cumbersome task to compile
all the healthcare repositories at a single location. Specific health data repositories dealing with a
particular healthcare problem and data sources are listed in the corresponding chapters where the
data is discussed. We hope that these repositories will be useful for both existing and upcoming
researchers who do not have access to the health data from hospitals and healthcare facilities.

What Is Probability?

Probability denotes the possibility of something happening. It is a mathematical concept that


predicts how likely events are to occur. The probability values are expressed between 0 and 1.
The definition of probability is the degree to which something is likely to occur. This
fundamental theory of probability is also applied to probability distributions.

Probability is the branch of mathematics that deals with the study of random events and their
likelihood of occurrence. It provides a way to quantify and reason about uncertainty, making it
an essential tool for decision-making in fields such as statistics, finance, engineering, and
science. The foundation of probability theory is based on the concept of sample spaces, events,
and probabilities, which are used to model and analyze a wide range of real-world phenomena.
Understanding probability is essential for many applications in our daily lives, including
predicting the weather, assessing risk, and analyzing data. There are several types of probability:

Classical probability: This is based on the concept of equally likely outcomes. For example, if
you roll a fair dice, the probability of rolling a 6 is 1/6, because there are 6 equally likely
outcomes (rolling a 1, 2, 3, 4, 5, or 6).

Empirical probability: This is based on the relative frequency of an event occurring in a large
number of trials. For example, if you flip a coin 100 times and it lands heads 60 times, the
empirical probability of the coin landing heads is 60/100, or 3/5.

Subjective probability: This is based on an individual's personal belief or judgment about the
likelihood of an event occurring. It is subjective because it can vary from person to person and is
not necessarily based on any objective measure.

Axiomatic probability: This is a mathematical approach to probability that is based on axioms (or
assumptions) about the probability of events occurring. It allows for the calculation of
probabilities using mathematical techniques and is widely used in the field of statistics.
Conditional probability: This is the probability of an event occurring given that another event has
already occurred. For example, the probability of rolling a 6 on a dice after rolling a 4 is 1/6,
because there are 6 equally likely outcomes (rolling a 1, 2, 3, 5, or 6) given that the first roll was
a 4.

Likelihood and odds

In healthcare analytics, understanding likelihood and odds is essential for making informed
decisions, predicting outcomes, and assessing risks. Let's delve into the concepts of likelihood
and odds, their applications in healthcare analytics, and their significance in improving patient
care and healthcare management.

1. Likelihood in Healthcare Analytics

Likelihood refers to the probability of an event occurring. In healthcare analytics, likelihood


plays a crucial role in various aspects:

Diagnostic Accuracy: Likelihood ratios (LR) are important metrics used to assess the accuracy of
diagnostic tests. The likelihood ratio compares the probability of a particular test result in
patients with the target disorder to the probability of the same result in patients without the
disorder.

Prediction Models: Likelihood-based models are commonly used in healthcare analytics to


predict the likelihood of future events or outcomes. For example, predictive analytics models can
forecast the likelihood of disease outbreaks, patient deterioration, or readmission.

Risk Assessment: Likelihood assessment helps in identifying high-risk patients or populations.


By analyzing patient data, healthcare providers can determine the likelihood of adverse events
such as complications, readmissions, or mortality.

2. Odds in Healthcare Analytics


Odds represent the ratio of the probability of an event occurring to the probability of it not
occurring. In healthcare analytics, odds are used for various purposes:

Odds Ratios: Odds ratios (OR) are frequently used in medical practice to compare the odds of an
outcome between different groups. For example, researchers may calculate the odds ratio to
compare the odds of developing a particular disease between patients with and without a certain
risk factor.

Comparative Analysis: Odds are useful for conducting comparative analyses in healthcare. By
comparing the odds of different outcomes or interventions, healthcare professionals can assess
their relative effectiveness or risk-benefit profiles.

Risk Communication: Communicating risks to patients or stakeholders often involves presenting


information in terms of odds. Understanding the odds allows individuals to make informed
decisions about their healthcare choices.

3. Significance of Likelihood and Odds in Healthcare Analytics

Likelihood and odds play significant roles in healthcare analytics for several reasons:

Clinical Decision Making: Healthcare providers rely on likelihood and odds to make clinical
decisions, such as diagnostic testing, treatment selection, and risk stratification. Understanding
the likelihood of certain outcomes helps clinicians prioritize interventions and allocate resources
effectively.

Predictive Modeling: Predictive analytics models use likelihood-based algorithms to forecast


future events or outcomes, such as disease progression, patient deterioration, or healthcare
utilization. These models enable proactive interventions and resource planning to improve
patient outcomes and reduce costs.

Risk Assessment and Management: Likelihood and odds are integral to risk assessment and
management in healthcare. By assessing the likelihood of adverse events or complications,
healthcare organizations can implement preventive measures, monitor high-risk patients closely,
and mitigate potential risks.

4. Challenges and Considerations

Despite their importance, there are challenges and considerations associated with likelihood and
odds in healthcare analytics:

Data Quality: Accurate estimation of likelihood and odds requires high-quality data. Data
inaccuracies, inconsistencies, or biases can affect the reliability of analytics insights and
decision-making.

Interpretation Complexity: Likelihood and odds ratios may be challenging to interpret for non-
specialists. Effective communication of these concepts to healthcare professionals, patients, and
stakeholders is essential for informed decision-making.

Ethical Considerations: There are ethical considerations surrounding the use of likelihood and
odds in healthcare analytics, particularly regarding privacy, consent, and the potential for
algorithmic bias. Ensuring transparency, fairness, and accountability in analytics practices is
crucial.

What Are Probability Distributions?

A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the
minimum and maximum possible values, but where the possible value would be plotted on the
probability distribution will be determined by a number of factors. The mean (average), standard
deviation, skewness, and kurtosis of the distribution are among these factors.

Types of Probability Distribution


The probability distribution is divided into two parts:

1. Discrete Probability Distributions

2. Continuous Probability Distributions

Discrete Probability Distribution

A discrete distribution describes the probability of occurrence of each value of a discrete random
variable. The number of spoiled apples out of 6 in your refrigerator can be an example of a
discrete probability distribution.

Each possible value of the discrete random variable can be associated with a non-zero probability
in a discrete probability distribution.

Let's discuss some significant probability distribution functions.

Binomial Distribution

The binomial distribution is a discrete distribution with a finite number of possibilities. When
observing a series of what are known as Bernoulli trials, the binomial distribution emerges. A
Bernoulli trial is a scientific experiment with only two outcomes: success or failure.

Consider a random experiment in which you toss a biased coin six times with a 0.4 chance of
getting head. If 'getting a head' is considered a ‘success’, the binomial distribution will show the
probability of r successes for each value of r.

The binomial random variable represents the number of successes (r) in n consecutive
independent Bernoulli trials.
Bernoulli's Distribution

The Bernoulli distribution is a variant of the Binomial distribution in which only one experiment
is conducted, resulting in a single observation. As a result, the Bernoulli distribution describes
events that have exactly two outcomes.

Here’s a Python Code to show Bernoulli distribution:


The Bernoulli random variable's expected value is p, which is also known as the Bernoulli
distribution's parameter.

The experiment's outcome can be a value of 0 or 1. Bernoulli random variables can have values
of 0 or 1.

The pmf function is used to calculate the probability of various random variable values.
Poisson Distribution

A Poisson distribution is a probability distribution used in statistics to show how many times an
event is likely to happen over a given period of time. To put it another way, it's a count
distribution. Poisson distributions are frequently used to comprehend independent events at a
constant rate over a given time interval. Siméon Denis Poisson, a French mathematician, was the
inspiration for the name.

The Python code below shows a simple example of Poisson distribution.

It has two parameters:

1. Lam: Known number of occurrences

2. Size: The shape of the returned array

The below-given Python code generates the 1x100 distribution for occurrence 5.
Continuous Probability Distributions

A continuous distribution describes the probabilities of a continuous random variable's possible


values. A continuous random variable has an infinite and uncountable set of possible values
(known as the range). The mapping of time can be considered as an example of the continuous
probability distribution. It can be from 1 second to 1 billion seconds, and so on.

The area under the curve of a continuous random variable's PDF is used to calculate its
probability. As a result, only value ranges can have a non-zero probability. A continuous random
variable's probability of equaling some value is always zero.

Now, look at some varieties of the continuous probability distribution.

Normal Distribution

Normal Distribution is one of the most basic continuous distribution types. Gaussian distribution
is another name for it. Around its mean value, this probability distribution is symmetrical. It also
demonstrates that data close to the mean occurs more frequently than data far from it. Here, the
mean is 0, and the variance is a finite value.

In the example, you generated 100 random variables ranging from 1 to 50. After that, you
created a function to define the normal distribution formula to calculate the probability density
function. Then, you have plotted the data points and probability density function against X-axis
and Y-axis, respectively.
Continuous Uniform Distribution

In continuous uniform distribution, all outcomes are equally possible. Each variable has the same
chance of being hit as a result. Random variables are spaced evenly in this symmetric
probabilistic distribution, with a 1/ (b-a) probability.
The below Python code is a simple example of continuous distribution taking 1000 samples of
random variables.

Log-Normal Distribution

The random variables whose logarithm values follow a normal distribution are plotted using this
distribution. Take a look at the random variables X and Y. The variable represented in this
distribution is Y = ln(X), where ln denotes the natural logarithm of X values.

The size distribution of rain droplets can be plotted using log normal distribution.
Exponential Distribution

In a Poisson process, an exponential distribution is a continuous probability distribution that


describes the time between events (success, failure, arrival, etc.).

You can see in the below example how to get random samples of exponential distribution and
return Numpy array samples by using the numpy.random.exponential() method.
What Is Predictive Analytics in Healthcare?

Predictive analytics is a discipline in the data analytics world that relies heavily on modeling,
data mining, AI, and machine learning techniques. It is used to evaluate historical and real-time
data to make predictions about the future.

Predictive analytics in healthcare refers to the analysis of current and historical healthcare data
that allows healthcare professionals to find opportunities to make more effective and more
efficient operational and clinical decisions, predict trends, and even manage the spread of
diseases.

Healthcare data is any data related to the health conditions of an individual or a group of people
and is collected from administrative and medical records, health surveys, disease and
patient registries, claims-based datasets, and EHRs. Healthcare analytics is a tool that anyone in
the healthcare industry can use and benefit from to provide better-quality care – healthcare
organizations, hospitals, doctors, physicians, psychologists, pharmacists, pharmaceutical
companies, and even healthcare stakeholders.

Use of Predictive Analytics in Healthcare


The healthcare industry generates a tremendous amount of data but struggles to convert that data
into useful insights to improve patient outcomes. Data analytics in healthcare is intended to be
applied to every aspect of patient care and operations management. It is used to investigate
methods of improving patient care, predicting disease outbreaks, reducing the cost of treatment,
and so much more. At a business level, with the help of analytics, healthcare organizations can
simplify internal operations, polish the utilization of their resources, and improve care teams’
coordination and efficiency.

The ability of data analytics to transform raw healthcare data into actionable insights has a
significant impact in the following healthcare areas:

 Clinical research
 Development of new treatments
 Discovery of new drugs
 Prediction and prevention of diseases
 Clinical decision support
 Quicker, more accurate diagnosis of medical conditions
 High success rates of surgeries and medications
 Automation of hospital administrative processes
 More accurate calculation of health insurance rates

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy