Unit 1
Unit 1
While the healthcare costs have been constantly rising, the quality of care provided to the
patients in the United States have not seen considerable improvements. Recently, several
researchers have conducted studies which showed that by incorporating the current healthcare
technologies, they are able to reduce mortality rates, healthcare costs, and medical complications
at various hospitals. In 2009, the US government enacted the Health Information Technology for
Economic and Clinical Health Act (HITECH) that includes an incentive program (around $27
billion) for the adoption and meaningful use of Electronic Health Records (EHRs).
The recent advances in information technology have led to an increasing ease in the ability to
collect various forms of healthcare data. In this digital world, data becomes an integral part of
healthcare. A recent report on Big Data suggests that the overall potential of healthcare data will
be around $300 billion [12]. Due to the rapid advancements in the data sensing and acquisition
technologies, hospitals and healthcare institutions have started collecting vast amounts of
healthcare data about their patients. Effectively understanding and building knowledge from
healthcare data requires developing advanced analytical techniques that can effectively transform
data into meaningful and actionable information. General computing technologies have started
revolutionizing the manner in which medical care is available to the patients. Data analytics, in
particular, forms a critical component of these computing technologies. The analytical solutions
when applied to healthcare data have an immense potential to transform healthcare delivery from
being reactive to more proactive. The impact of analytics in the healthcare domain is only going
to grow more in the next several years. Typically, analyzing health data will allow us to
understand the patterns that are hidden in the data. Also, it will help the clinicians to build an
individualized patient profile and can accurately compute the likelihood of an individual patient
to suffer from a medical complication in the near future.
Healthcare data is particularly rich and it is derived from a wide variety of sources such as
sensors, images, text in the form of biomedical literature/clinical notes, and traditional electronic
records. This heterogeneity in the data collection and representation process leads to numerous
challenges in both the processing and analysis of the underlying data. There is a wide diversity in
the techniques that are required to analyze these different forms of data. In addition, the
heterogeneity of the data naturally creates various data integration and data analysis challenges.
In many cases, insights can be obtained from diverse data types, which are otherwise not
possible from a single source of the data. It is only recently that the vast potential of such
integrated data analysis methods is being realized.
From a researcher and practitioner perspective, a major challenge in healthcare is its
interdisciplinary nature. The field of healthcare has often seen advances coming from diverse
disciplines such as databases, data mining, information retrieval, medical researchers, and
healthcare practitioners. While this interdisciplinary nature adds to the richness of the field, it
also adds to the challenges in making significant advances. Computer scientists are usually not
trained in domain-specific medical concepts, whereas medical practitioners and researchers also
have limited exposure to the mathematical and statistical background required in the data
analytics area. This has added to the difficulty in creating a coherent body of work in this field
even though it is evident that much of the available data can benefit from such advanced analysis
techniques. The result of such a diversity has often led to independent lines of work from
completely different perspectives. Researchers in the field of data analytics are particularly
susceptible to becoming isolated from real domain-specific problems, and may often propose
problem formulations with excellent technique but with no practical use. This book is an attempt
to bring together these diverse communities by carefully and comprehensively discussing the
most relevant contributions from each domain. It is only by bringing together these diverse
communities that the vast potential of data analysis methods can be harnessed.
FIGURE 1.1: The overall organization of the book’s contents.
Another major challenge that exists in the healthcare domain is the “data privacy gap” between
medical researchers and computer scientists. Healthcare data is obviously very sensitive because
it can reveal compromising information about individuals. Several laws in various countries,
such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States,
explicitly forbid the release of medical information about individuals for any purpose, unless
safeguards are used to preserve privacy. Medical researchers have natural access to healthcare
data because their research is often paired with an actual medical practice. Furthermore, various
mechanisms exist in the medical domain to conduct research studies with voluntary participants.
Such data collection is almost always paired with anonymity and confidentiality agreements.
On the other hand, acquiring data is not quite as simple for computer scientists without a proper
collaboration with a medical practitioner. Even then, there are barriers in the acquisition of data.
Clearly, many of these challenges can be avoided if accepted protocols, privacy technologies,
and safeguards are in place. Therefore, this book will also address these issues. Figure 1.1
provides an overview of the organization of the book’s contents. This book is organized into
three parts:
1. Healthcare Data Sources and Basic Analytics: This part discusses the details of various
healthcare data sources and the basic analytical methods that are widely used in the
processing and analysis of such data. The various forms of patient data that is currently
being collected in both clinical and non-clinical environments will be studied. The clinical
data will have the structured electronic health records and biomedical images. Sensor data
has been receiving a lot attention recently. Techniques for mining sensor data and
biomedical signal analysis will be presented. Personalized medicine has gained a lot of
importance due to the advancements in genomic data. Genomic data analysis involves
several statistical techniques. These will also be elaborated. Patients’ in-hospital clinical
data will also include a lot of unstructured data in the form of clinical notes. In addition, the
domain knowledge that can be extracted by mining the biomedical literature, will also be
discussed. The fundamental data mining, machine learning, information retrieval, and
natural language processing techniques for processing these data types will be extensively
discussed. Finally, behavioral data captured through social media will also be discussed.
2. Advanced Data Analytics for Healthcare: This part deals with the advanced analytical
methods focused on healthcare. This includes the clinical prediction models, temporal data
mining methods, and visual analytics. Integrating heterogeneous data such as clinical and
genomic data is essential for improving the predictive power of the data that will also be
discussed. Information retrieval techniques that can enhance the quality of biomedical
search will be presented. Data privacy is an extremely important concern in healthcare.
Privacy-preserving data publishing techniques will therefore be presented.
3. Applications and Practical Systems for Healthcare: This part focuses on the practical
applications of data analytics and the systems developed using data analytics for healthcare
and clinical practice. Examples include applications of data analytics to pervasive
healthcare, fraud detection, and drug discovery. In terms of the practical systems, we will
discuss the details about the clinical decision support systems, computer assisted medical
imaging systems, and mobile imaging systems.
These different aspects of healthcare are related to one another. Therefore, the chapters in each of
the aforementioned topics are interconnected. Where necessary, pointers are provided across
different chapters, depending on the underlying relevance. This chapter is organized as follows.
Section 1.2 discusses the main data sources that are commonly used and the basic techniques for
processing them. Section 1.3 discusses advanced techniques in the field of healthcare data
analytics. Section 1.4 discusses a number of applications of healthcare analysis techniques. An
overview of resources in the field of healthcare data analytics is presented in Section 1.5. Section
1.6 presents the conclusions.
Electronic health records (EHRs) contain a digitized version of a patient’s medical history. It
encompasses a full range of data relevant to a patient’s care such as demographics, problems,
medications, physician’s observations, vital signs, medical history, laboratory data, radiology
reports, progress notes, and billing data. Many EHRs go beyond a patient’s medical or treatment
history and may contain additional broader perspectives of a patient’s care. An important
property of EHRs is that they provide an effective and efficient way for healthcare providers and
organizations to share with one another. In this context, EHRs are inherently designed to be in
real time and they can instantly be accessed and edited by authorized users. This can be very
useful in practical settings. For example, a hospital or specialist may wish to access the medical
records of the primary provider. An electronic health record streamlines the workflow by
allowing direct access to the updated records in real time [30]. It can generate a complete record
of a patient’s clinical encounter, and support other care-related activities such as evidence-based
decision support, quality management, and outcomes reporting. The storage and retrieval of
health-related data is more efficient using EHRs. It helps to improve quality and convenience of
patient care, increase patient participation in the healthcare process, improve accuracy of
diagnoses and health outcomes, and improve care coordination [29]. Various components of
EHRs along with the advantages, barriers, and challenges of using EHRs are discussed in
Chapter 2.
Sensor data [2] is ubiquitous in the medical domain both for real time and for retrospective
analysis. Several forms of medical data collection instruments such as electrocardiogram (ECG),
and electroencaphalogram(EEG) are essentially sensors that collect signals from various parts of
the human body [32]. These collected data instruments are sometimes used for retrospective
analysis, but more often for real-time analysis. Perhaps, the most important use-case of real-time
analysis is in the context of intensive care units (ICUs) and real-time remote monitoring of
patients with specific medical conditions. In all these cases, the volume of the data to the
processed can be rather large. For example, in an ICU, it is not uncommon for the sensor to
receive input from hundreds of data sources, and alarms need to be triggered in real time. Such
applications necessitate the use of big-data frameworks and specialized hardware platforms. In
remote-monitoring applications, both the real-time events and a long-term analysis of various
trends and treatment alternatives is of great interest.
While rapid growth in sensor data offers significant promise to impact healthcare, it also
introduces a data overload challenge. Hence, it becomes extremely important to develop novel
data analytical tools that can process such large volumes of collected data into meaningful and
interpretable knowledge. Such analytical methods will not only allow for better observing
patients’ physiological signals and help provide situational awareness to the bedside, but also
provide better insights into the inefficiencies in the healthcare system that may be the root cause
of surging costs. The research challenges associated with the mining of sensor data in healthcare
settings and the sensor mining applications and systems in both clinical and non-clinical settings
is discussed in Chapter 4.
Biomedical Signal Analysis consists of measuring signals from biological sources, the origin of
which lies in various physiological processes. Examples of such signals include the
electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG),
electroencephalogram (EEG), electrogastrogram (EGG), phonocardiogram(PCG), and so on. The
analysis of these signals is vital in diagnosing the pathological conditions and in deciding an
appropriate care pathway. The measurement of physiological signals gives some form of
quantitative or relative assessment of the state of the human body. These signals are acquired
from various kinds of sensors and transducers either invasively or non-invasively.
These signals can be either discrete or continuous depending on the kind of care or severity of a
particular pathological condition. The processing and interpretation of physiological signals is
challenging due to the low signal-to-noise ratio (SNR) and the interdependencyof the
physiological systems. The signal data obtained from the corresponding medical instruments can
be copiously noisy, and may sometimes require a significant amount of preprocessing. Several
signal processing algorithms have been developed that have significantly enhanced the
understanding of the physiological processes. A wide variety of methods are used for filtering,
noise removal, and compact methods [36]. More sophisticated analysis methods including
dimensionality reduction techniques such as Principal Component Analysis (PCA), Singular
Value Decomposition (SVD), and wavelet transformation have also been widely investigated in
the literature. A broader overview of many of these techniques may also be found in [1, 2]. Time-
series analysis methods are discussed in [37, 40]. Chapter 5 presents an overview of various
signal processing techniques used for processing biomedical signals.
A significant number of diseases are genetic in nature, but the nature of the causality between the
genetic markers and the diseases has not been fully established. For example, diabetes is well
known to be a genetic disease; however, the full set of genetic markers that make an individual
prone to diabetes are unknown. In some other cases, such as the blindness caused by Stargardt
disease, the relevant genes are known but all the possible mutations have not been exhaustively
isolated. Clearly, a broader understanding of the relationships between various genetic markers,
mutations, and disease conditions has significant potential in assisting the development of
various gene therapies to cure these conditions. One will be mostly interested in understanding
what kind of health-related questions can be addressed through in-silico analysis of the genomic
data through typical data-driven studies. Moreover, translating genetic discoveries into
personalized medicine practice is a highly non-trivial task with a lot of unresolved challenges.
For example, the genomic landscapes in complex diseases such as cancers are overwhelmingly
complicated, revealing a high order of heterogeneity among different individuals. Solving these
issues will be fitting a major piece of the puzzle and it will bring the concept of personalized
medicine much more closer to reality.
Recent advancements made in the biotechnologies have led to the rapid generation of large
volumes of biological and medical information and advanced genomic research. This has also led
to unprecedented opportunities and hopes for genome scale study of challenging problems in life
science. For example, advances in genomic technology made it possible to study the complete
genomic landscape of healthy individuals for complex diseases [16]. Many of these research
directions have already shown promising results in terms of generating new insights into the
biology of human disease and to predict the personalized response of the individual to a
particular treatment. Also, genetic data are often modeled either as sequences or as networks.
Therefore, the work in this field requires a good understanding of sequence and network mining
techniques. Various data analytics-based solutions are being developed for tackling key research
problems in medicine such as identification of disease biomarkers and therapeutic targets and
prediction of clinical outcome. More details about the fundamental computational algorithms and
bioinformatics tools for genomic data analysis along with genomic data resources are discussed
in Chapter 6.
Most of the information about patients is encoded in the form of clinical notes. These notes are
typically stored in an unstructured data format and is the backbone of much of healthcare data.
These contain the clinical information from the transcription of dictations, direct entry by
providers, or use of speech recognition applications. These are perhaps the richest source of
unexploited information. It is needless to say that the manual encoding of this free-text form on a
broad range of clinical information is too costly and time consuming, though it is limited to
primary and secondary diagnoses, and procedures for billing purposes. Such notes are
notoriously challenging to analyze automatically due to the complexity involved in converting
clinical text that is available in free-text to a structured format. It becomes hard mainly because
of their unstructured nature, heterogeneity, diverse formats, and varying context across different
patients and practitioners.
Natural language processing (NLP) and entity extraction play an important part in inferring
useful knowledge from large volumes of clinical text to automatically encoding clinical
information in a timely manner[22]. In general, data preprocessing methods are more importantin
these contexts as compared to the actual mining techniques. The processing of clinical text using
NLP methods is more challenging when compared to the processing of other texts due to the
ungrammatical nature of short and telegraphic phrases, dictations, shorthand lexicons such as
abbreviations and acronyms, and often misspelled clinical terms. All these problems will have a
direct impact on the various standard NLP tasks such as shallow or full parsing, sentence
segmentation, text categorization, etc., thus making the clinical text processing highly
challenging. A wide range of NLP methods and data mining techniques for extracting
information from the clinical text are discussed in Chapter 7.
The rapid emergence of various social media resources such as social networking sites,
blogs/microblogs, forums, question answering services, and online communities provides a
wealth of information about public opinion on various aspects of healthcare. Social media data
can be mined for patterns and knowledge that can be leveraged to make useful inferences about
population health and public health monitoring. A significant amount of public health
information can be gleaned from the inputs of various participants at social media sites. Although
most individual social media posts and messages contain little informational value, aggregation
of millions of such messages can generate important knowledge [4, 19]. Effectively analyzing
these vast pieces of knowledge can significantly reduce the latency in collecting such complex
information.
Previous research on social media analytics for healthcare has focused on capturing aggregate
health trends such as outbreaks of infectious diseases, detecting reports of adverse drug
interactions, and improving interventional capabilities for health-related activities. Disease
outbreak detection is often strongly reflected in the content of social media and an analysis of the
history of the content provides valuable insights about disease outbreaks. Topic models are
frequently used for high-level analysis of such health-related content. An additional source of
information in social media sites is obtained from online doctor and patient communities. Since
medical conditions recur across different individuals, the online communities provide a valuable
source of knowledge about various medical conditions. A major challenge in social media
analysis is that the data is often unreliable, and therefore the results must be interpreted with
caution. More discussion about the impact of social media analytics in improving healthcare is
given in Chapter 9.
Healthcare data almost always contain time information and it is inconceivable to reason and
mine these data without incorporating the temporal dimension. There are two major sources of
temporal data generated in the healthcare domain. The first is the electronic health records (EHR)
data and the second is the sensor data. Mining the temporal dimension of EHR data is extremely
promising as it may reveal patterns that enable a more precise understanding of disease
manifestation, progression and response to therapy. Some of the unique characteristics of EHR
data (such as of heterogeneous, sparse, high-dimensional, irregular time intervals) makes
conventional methods inadequate to handle them. Unlike EHR data, sensor data are usually
represented as numeric time series that are regularly measured in time at a high frequency.
Examples of these data are physiological data obtained by monitoring the patients on a regular
basis and other electrical activity recordings such as electrocardiogram (ECG),
electroencephalogram (EEG), etc. Sensor data for a specific subject are measured over a much
shorter period of time (usually several minutes to several days) compared to the longitudinal
EHR data (usually collected across the entire lifespan of the patient).
Given the different natures of EHR data and sensor data, the choice of appropriate temporal data
mining methods for these types of data are often different. EHR data are usually mined using
temporal pattern mining methods, which represent data instances (e.g., patients’ records) as
sequences of discrete events (e.g., diagnosis codes, procedures, etc.) and then try to find and
enumerate statistically relevant patterns that are embedded in the data. On the other hand, sensor
data are often analyzed using signal processing and time-series analysis techniques (e.g., wavelet
transform, independent component analysis, etc.) [37, 40]. Chapter 11 presents a detailed survey
and summarizes the literature on temporal data mining for healthcare data.
The ability to analyze and identify meaningful patterns in multimodal clinical data must be
addressed in order to provide a better understanding of diseases and to identify patterns that
could be affecting the clinical workflow. Visual analytics provides a way to combine the
strengths of human cognition with interactive interfaces and data analytics that can facilitate the
exploration of complex datasets. Visual analytics is a science that involves the integration of
interactive visual interfaces with analytical techniques to develop systems that facilitate
reasoning over, and interpretation of, complex data [23]. Visual analytics is popular in many
aspects of healthcare data analysis because of the wide variety of insights that such an analysis
provides. Due to the rapid increase of health-related information, it becomes critical to build
effective ways of analyzing large amounts of data by leveraging human–computer interaction
and graphical interfaces. In general, providing easily understandable summaries of complex
healthcare data is useful for a human in gaining novel insights.
In the evaluation of many diseases, clinicians are presented with datasets that often contain
hundreds of clinical variables. The multimodal, noisy, heterogeneous, and temporal
characteristics of the clinical data pose significant challenges to the users while synthesizing the
information and obtaining insights from the data [24]. The amount of information being
producedby healthcare organizations opens up opportunities to design new interactive interfaces
to explore large-scale databases, to validate clinical data and coding techniques, and to increase
transparency within different departments, hospitals, and organizations. While many of the visual
methods can be directly adopted from the data mining literature [11], a number of methods,
which are specific to the healthcare domain, have also been designed. A detailed discussion on
the popular data visualization techniques used in clinical settings and the areas in healthcare that
benefit from visual analytics are discussed in Chapter 12.
In the healthcare domain, the definition of privacy is commonly accepted as “a person’s right
and desire to control the disclosure of their personal health information” [25]. Patients’ health-
related data is highly sensitive because of the potentially compromising information about
individual participants. Various forms of data such as disease information or genomic
information may be sensitive for different reasons. To enable research in the field of medicine, it
is often important for medical organizations to be able to share their data with statistical experts.
Sharing personal health information can bring enormous economical benefits. This naturally
leads to concerns about the privacy of individuals being compromised. The data privacy problem
is one of the most important challenges in the field of healthcare data analytics. Most privacy
preservation methods reduce the representation accuracy of the data so that the identification of
sensitive attributes of an individual is compromised. This can be achieved by either perturbing
the sensitive attribute, perturbing attributes that serve as identification mechanisms, or a
combination of the two. Clearly, this process required the reduction in the accuracy of data
representation. Therefore, privacy preservation almost always incurs the cost of losing some data
utility. Therefore, the goal of privacy preservation methods is to optimize the trade-off between
utility and privacy. This ensures that the amount of utility loss at a given level of privacy is as
little as possible.
The major steps in privacy-preserving data publication algorithms [5][18] are the identification
of an appropriate privacy metric and level for a given access setting and data characteristics,
application of one or multiple privacy-preserving algorithm(s) to achieve the desired privacy
level, and post analyzing the utility of the processed data. These three steps are repeated until the
desired utility and privacy levels are jointly met. Chapter 15 focuses on applying privacy-
preserving algorithms to healthcare data for secondary-use data publishing and interpretation of
the usefulness and implications of the processed data.
Pervasive health refers to the process of tracking medical well-being and providing long-term
medical care with the use of advanced technologies such as wearable sensors. For example,
wearable monitors are often used for measuring the long-term effectiveness of various treatment
mechanisms. These methods, however, face a number of challenges, such as knowledge
extraction from the large volumes of data collected and real-time processing. However, recent
advances in both hardware and software technologies (data analytics in particular) have made
such systems a reality. These advances have made low cost intelligent health systems embedded
within the home and living environments a reality [33].
A wide variety of sensor modalities can be used when developing intelligent health systems,
including wearable and ambient sensors [28]. In the case of wearable sensors, sensors are
attached to the body or woven into garments. For example, 3-axis accelerometers distributed
over an individual’s body can provide information about the orientation and movement of the
corresponding body part. In addition to these advancements in sensing modalities, there has been
an increasing interest in applying analytics techniques to data collected from such equipment.
Several practical healthcare systems have started using analytical solutions. Some examples
include cognitive health monitoring systems based on activity recognition, persuasive systems
for motivating users to change their health and wellness habits, and abnormal health condition
detection systems.
Healthcare fraud has been one of the biggest problems faced by the United States and costs
several billions of dollars every year. With growing healthcare costs, the threat of healthcare
fraud is increasing at an alarming pace. Given the recent scrutiny of the inefficiencies in the US
healthcare system, identifying fraud has been on the forefront of the efforts towards reducing the
healthcare costs. One could analyze the healthcare claims data along different dimensions to
identify fraud. The complexity of the healthcare domain, which includes multiple sets of
participants, including healthcare providers, beneficiaries (patients), and insurance companies,
makes the problem of detecting healthcare fraud equally challenging and makes it different from
other domains such as credit card fraud detection and auto insurance fraud detection. In these
other domains, the methods rely on constructing profiles for the users based on the historical data
and they typically monitor deviations in the behavior of the user from the profile [7]. However,
in healthcare fraud, such approaches are not usually applicable, because the users in the
healthcare setting are the beneficiaries, who typically are not the fraud perpetrators. Hence, more
sophisticated analysis is required in the healthcare sector to identify fraud.
Several solutions based on data analytics have been investigated for solving the problem of
healthcare fraud. The primary advantages of data-driven fraud detection are automatic extraction
of fraud patterns and prioritization of suspicious cases [3]. Most of such analysis is performed
with respect to an episode of care, which is essentially a collection of healthcare provided to a
patient under the same health issue.
The cost of successful novel chemistry-based drug development often reaches millions of
dollars, and the time to introduce the drug to market often comes close to a decade [34]. The high
failure rate of drugs during this process, make the trial phases known as the “valley of death.”
Most new compounds fail during the FDA approval process in clinical trials or cause adverse
side effects. Interdisciplinary computational approaches that combine statistics, computer
science, medicine, chemoinformatics, and biology are becoming highly valuable for drug
discovery and development. In the context of pharmaceutical discoveries, data analytics can
potentially limit the search space and provide recommendations to the domain experts for
hypothesis generation and further analysis and experiments.
Data analytics can be used in several stages of drug discovery and development to achieve
different goals. In this domain, one way to categorize data analytical approaches is based on their
application to pre-marketing and post-marketing stages of the drug discovery and development
process. In the pre-marketing stage, data analytics focus on discovery activities such as finding
signals that indicate relations between drugs and targets, drugs and drugs, genes and diseases,
protein and diseases, and finding biomarkers. In the post-marketing stage an important
application of data analytics is to find indications of adverse side effects for approved drugs.
These methods provide a list of potential drug side effect associations that can be used for further
studies. Chapter 18 provides more discussion of the applications of data analytics for
pharmaceutical discoveries including drug-target interaction prediction and pharmacovigilance.
Clinical Decision Support Systems (CDSS) are computer systems designed to assist clinicians
with patient-related decision making, such as diagnosis and treatment [6]. CDSS have become a
crucial component in the evaluation and improvement of patient treatment since they have shown
to improve both patient outcomes and cost of care [35]. They can help in minimizing analytical
errors by notifying the physician of potentially harmful drug interactions, and their diagnostic
procedures have been shown to enable more accurate diagnoses. Some of the main advantages of
CDSS are their ability in decision making and determining optimal treatment strategies, aiding
general health policies by estimating the clinical and economic outcomes of different treatment
methods and even estimating treatment outcomes under certain conditions. The main reason for
the success of CDSS are their electronic nature, seamless integration with clinical workflows,
providing decision support at the appropriate time/location. Two particular fields of healthcare
where CDSS have been extremely influential are pharmacy and billing. CDSS can help
pharmacies to look for negative drug interactions and then report them to the corresponding
patient’s ordering professional. In the billing departments, CDSS have been used to devise
treatment plans that provide an optimal balance of patient care and financial expense.
Mobile imaging refers to the application of portable computers such as smartphones or tablet
computers to store, visualize, and process images with and without connections to servers, the
Internet, or the cloud. Today, portable devices provide sufficient computational power for
biomedical image processing and smart devices have been introducedin the operationtheater.
While many techniques for biomedical image acquisition will always require special equipment,
the regular camera is one of the most widely used imaging modality in hospitals. Mobile
technology and smart devices, especially smartphones, allows new ways of easier imaging at the
patient’s bedside and possess the possibility to be made into a diagnostic tool that can be used by
medical professionals. Smartphones usually contain at least one high-resolution camera that can
be used for image formation. Several challenges arise during the acquisition, visualization,
analysis, and management of images in mobile environments.
What Is Probability?
Probability is the branch of mathematics that deals with the study of random events and their
likelihood of occurrence. It provides a way to quantify and reason about uncertainty, making it
an essential tool for decision-making in fields such as statistics, finance, engineering, and
science. The foundation of probability theory is based on the concept of sample spaces, events,
and probabilities, which are used to model and analyze a wide range of real-world phenomena.
Understanding probability is essential for many applications in our daily lives, including
predicting the weather, assessing risk, and analyzing data. There are several types of probability:
Classical probability: This is based on the concept of equally likely outcomes. For example, if
you roll a fair dice, the probability of rolling a 6 is 1/6, because there are 6 equally likely
outcomes (rolling a 1, 2, 3, 4, 5, or 6).
Empirical probability: This is based on the relative frequency of an event occurring in a large
number of trials. For example, if you flip a coin 100 times and it lands heads 60 times, the
empirical probability of the coin landing heads is 60/100, or 3/5.
Subjective probability: This is based on an individual's personal belief or judgment about the
likelihood of an event occurring. It is subjective because it can vary from person to person and is
not necessarily based on any objective measure.
Axiomatic probability: This is a mathematical approach to probability that is based on axioms (or
assumptions) about the probability of events occurring. It allows for the calculation of
probabilities using mathematical techniques and is widely used in the field of statistics.
Conditional probability: This is the probability of an event occurring given that another event has
already occurred. For example, the probability of rolling a 6 on a dice after rolling a 4 is 1/6,
because there are 6 equally likely outcomes (rolling a 1, 2, 3, 5, or 6) given that the first roll was
a 4.
In healthcare analytics, understanding likelihood and odds is essential for making informed
decisions, predicting outcomes, and assessing risks. Let's delve into the concepts of likelihood
and odds, their applications in healthcare analytics, and their significance in improving patient
care and healthcare management.
Diagnostic Accuracy: Likelihood ratios (LR) are important metrics used to assess the accuracy of
diagnostic tests. The likelihood ratio compares the probability of a particular test result in
patients with the target disorder to the probability of the same result in patients without the
disorder.
Odds Ratios: Odds ratios (OR) are frequently used in medical practice to compare the odds of an
outcome between different groups. For example, researchers may calculate the odds ratio to
compare the odds of developing a particular disease between patients with and without a certain
risk factor.
Comparative Analysis: Odds are useful for conducting comparative analyses in healthcare. By
comparing the odds of different outcomes or interventions, healthcare professionals can assess
their relative effectiveness or risk-benefit profiles.
Likelihood and odds play significant roles in healthcare analytics for several reasons:
Clinical Decision Making: Healthcare providers rely on likelihood and odds to make clinical
decisions, such as diagnostic testing, treatment selection, and risk stratification. Understanding
the likelihood of certain outcomes helps clinicians prioritize interventions and allocate resources
effectively.
Risk Assessment and Management: Likelihood and odds are integral to risk assessment and
management in healthcare. By assessing the likelihood of adverse events or complications,
healthcare organizations can implement preventive measures, monitor high-risk patients closely,
and mitigate potential risks.
Despite their importance, there are challenges and considerations associated with likelihood and
odds in healthcare analytics:
Data Quality: Accurate estimation of likelihood and odds requires high-quality data. Data
inaccuracies, inconsistencies, or biases can affect the reliability of analytics insights and
decision-making.
Interpretation Complexity: Likelihood and odds ratios may be challenging to interpret for non-
specialists. Effective communication of these concepts to healthcare professionals, patients, and
stakeholders is essential for informed decision-making.
Ethical Considerations: There are ethical considerations surrounding the use of likelihood and
odds in healthcare analytics, particularly regarding privacy, consent, and the potential for
algorithmic bias. Ensuring transparency, fairness, and accountability in analytics practices is
crucial.
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the
minimum and maximum possible values, but where the possible value would be plotted on the
probability distribution will be determined by a number of factors. The mean (average), standard
deviation, skewness, and kurtosis of the distribution are among these factors.
A discrete distribution describes the probability of occurrence of each value of a discrete random
variable. The number of spoiled apples out of 6 in your refrigerator can be an example of a
discrete probability distribution.
Each possible value of the discrete random variable can be associated with a non-zero probability
in a discrete probability distribution.
Binomial Distribution
The binomial distribution is a discrete distribution with a finite number of possibilities. When
observing a series of what are known as Bernoulli trials, the binomial distribution emerges. A
Bernoulli trial is a scientific experiment with only two outcomes: success or failure.
Consider a random experiment in which you toss a biased coin six times with a 0.4 chance of
getting head. If 'getting a head' is considered a ‘success’, the binomial distribution will show the
probability of r successes for each value of r.
The binomial random variable represents the number of successes (r) in n consecutive
independent Bernoulli trials.
Bernoulli's Distribution
The Bernoulli distribution is a variant of the Binomial distribution in which only one experiment
is conducted, resulting in a single observation. As a result, the Bernoulli distribution describes
events that have exactly two outcomes.
The experiment's outcome can be a value of 0 or 1. Bernoulli random variables can have values
of 0 or 1.
The pmf function is used to calculate the probability of various random variable values.
Poisson Distribution
A Poisson distribution is a probability distribution used in statistics to show how many times an
event is likely to happen over a given period of time. To put it another way, it's a count
distribution. Poisson distributions are frequently used to comprehend independent events at a
constant rate over a given time interval. Siméon Denis Poisson, a French mathematician, was the
inspiration for the name.
The below-given Python code generates the 1x100 distribution for occurrence 5.
Continuous Probability Distributions
The area under the curve of a continuous random variable's PDF is used to calculate its
probability. As a result, only value ranges can have a non-zero probability. A continuous random
variable's probability of equaling some value is always zero.
Normal Distribution
Normal Distribution is one of the most basic continuous distribution types. Gaussian distribution
is another name for it. Around its mean value, this probability distribution is symmetrical. It also
demonstrates that data close to the mean occurs more frequently than data far from it. Here, the
mean is 0, and the variance is a finite value.
In the example, you generated 100 random variables ranging from 1 to 50. After that, you
created a function to define the normal distribution formula to calculate the probability density
function. Then, you have plotted the data points and probability density function against X-axis
and Y-axis, respectively.
Continuous Uniform Distribution
In continuous uniform distribution, all outcomes are equally possible. Each variable has the same
chance of being hit as a result. Random variables are spaced evenly in this symmetric
probabilistic distribution, with a 1/ (b-a) probability.
The below Python code is a simple example of continuous distribution taking 1000 samples of
random variables.
Log-Normal Distribution
The random variables whose logarithm values follow a normal distribution are plotted using this
distribution. Take a look at the random variables X and Y. The variable represented in this
distribution is Y = ln(X), where ln denotes the natural logarithm of X values.
The size distribution of rain droplets can be plotted using log normal distribution.
Exponential Distribution
You can see in the below example how to get random samples of exponential distribution and
return Numpy array samples by using the numpy.random.exponential() method.
What Is Predictive Analytics in Healthcare?
Predictive analytics is a discipline in the data analytics world that relies heavily on modeling,
data mining, AI, and machine learning techniques. It is used to evaluate historical and real-time
data to make predictions about the future.
Predictive analytics in healthcare refers to the analysis of current and historical healthcare data
that allows healthcare professionals to find opportunities to make more effective and more
efficient operational and clinical decisions, predict trends, and even manage the spread of
diseases.
Healthcare data is any data related to the health conditions of an individual or a group of people
and is collected from administrative and medical records, health surveys, disease and
patient registries, claims-based datasets, and EHRs. Healthcare analytics is a tool that anyone in
the healthcare industry can use and benefit from to provide better-quality care – healthcare
organizations, hospitals, doctors, physicians, psychologists, pharmacists, pharmaceutical
companies, and even healthcare stakeholders.
The ability of data analytics to transform raw healthcare data into actionable insights has a
significant impact in the following healthcare areas:
Clinical research
Development of new treatments
Discovery of new drugs
Prediction and prevention of diseases
Clinical decision support
Quicker, more accurate diagnosis of medical conditions
High success rates of surgeries and medications
Automation of hospital administrative processes
More accurate calculation of health insurance rates