0% found this document useful (0 votes)
206 views7 pages

Augsep11 Heidorn

Biodiversity Informatics is a subdiscipline of biological informatics. It provides essential scientific knowledge to better understand global ecosystems. Challenges to information access persist despite digital techniques and applications.

Uploaded by

api-279655137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views7 pages

Augsep11 Heidorn

Biodiversity Informatics is a subdiscipline of biological informatics. It provides essential scientific knowledge to better understand global ecosystems. Challenges to information access persist despite digital techniques and applications.

Uploaded by

api-279655137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

Biodiversity Informatics
by P. Bryan Heidorn

EDITORS SUMMARY
For millennia, information about biological diversity has been collected as a way to
understand the living world. The information record has evolved from early papyrus to
modern electronic collections, but challenges to information access persist, despite and
partly due to a variety of digital techniques and applications. Information exchange is
hampered by lack of access to previous work, inconsistent naming, changes over time and
insufficient resources to create comprehensive databases supporting federated search
with Darwin Core metadata. With progress in biodiversity informatics, we will see greater
use of DNA barcoding and metagenomic techniques to describe species, remote sensing
tools and geographic information systems to detect and describe species locations and
movements and to identify habitats and environmental conditions. Biodiversity informatics
provides essential scientific knowledge to better understand global ecosystems and to
inform land use and policy decisions.
KEYWORDS
biology
informatics
scientific and technical information

iodiversity informatics has been around since early civilization,


including the Ebers Egyptian Medical Papyrus dated from 1500
BCE, which was derived from much older texts. In fact, some might
argue that biodiversity informatics began when shamans and healers began
teaching their pupils about the names and uses of plants. It is only the
technology that has changed to answer the key questions: What plants and
animals live in a location, what are they called, how do they live and how do
they relate to humans? Every amateur birder, butterfly collector and nature
lover is actively engaged in biodiversity informatics. Many of them use paper
maps and paper-based dichotomous keys to identify their birds, lepidopterons
or other living things. Some use and contribute to more recently developed
online databases of birdcalls and observation records of fellow enthusiasts.
The remainder of this article will focus on advances in the field over the
past couple of decades.
Biodiversity informatics is a subdiscipline of biological informatics.
Biological informatics comprises the information tools used in all of biology
including everything from biomolecular structure to global ecosystems.
Biodiversity informatics tends to focus more on entire organisms, the
interaction of different organisms and their place in the environment. The
contraction, bioinformatics, has come to refer to molecular biology only, but
many in the field of biodiversity informatics use the term bioinformatics to
refer to their work.

Relying on Literature Standards


P. Bryan Heidorn is the director of the School of Information Resources and Library
Science at the University of Arizona, president of the JRS Biodiversity Foundation and a
previous program officer in the division of biological infrastructure at the National
Science Foundation. He can be reached at by email at heidorn<at>email.arizona.edu.

The form of literature on biodiversity has changed in its several thousand


years of history from formularies baked on clay tablets, papyrus, scrolls and
finally bound books. Some of these works can be found at the great museums
of the world where natural history and cultural history cross. Biodiversity
and taxonomy in particular is unique in that the literature since the time of

38
CONTENTS

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

Linnaeus is used regularly. A scientist cannot name a new species until she
has searched the literature to insure that it does not already have a Linnaean
scientific name. Some of the publications are now hundreds of years old and
rare. Some contain now valuable art depicting plants and animals and are
much sought after by collectors.
This lasting and continuing value has led to efforts to digitize the collections
and make them available in a more cost-effective manner to scientists all over
the world, including scientists and other interested people in the developing
world, who previously did not have access to these publications unless they
traveled to Europe or the United States. The Biodiversity Heritage Library
(www.biodiversitylibrary.org/) is one such effort that has digitized 36
million pages to date. Originally a collaboration between major museum
libraries in the United States and the United Kingdom, the collaboration has
recently expanded to include much of Europe, China and Australia.
While these efforts provide unprecedented access to these materials there
are many informatics challenges, which include, among others, poor optical
character recognition (OCR) and page level access. Existing OCR technology
cannot handle the huge variation in fonts in these publications and the multiple
languages sometimes within the same document. Particularly troublesome
is the identification of scientific names since OCR was introduced. Special
purpose software tools such as TaxonFinder have improved the situation, but
the problem is far from solved. There is no global index to articles, chapter
and taxonomic treatments in these newly digitized documents so another
area of research is the identification of sections, articles and taxonomic
treatments within the publications so that improved indexes can be constructed.
Standardizing Markup and Extraction. While some people wish to read
biodiversity publications from front to back, more often people simply wish
to find facts in the publications. To address this need, biodiversity informatics
also includes the semantic markup, semantic information extraction and text
fusion from biodiversity materials. Semantic markup and extraction use
software techniques such as machine learning to add semantic tagging,
sometimes in XML, within digital documents to identify relevant facts. The
techniques might be used to identify not only treatment boundaries but also
taxonomy (scientific names) and morphological characteristics in descriptions

such as leaf shape in plants, of antennae shapes in beetles, butterflies and


moths or any other morphological characteristic of a plant or animal. This
markup makes it easy to extract the information to make new indexes or
more structured, machine-readable descriptions as discussed next. Text fusion
collects facts from multiple publications to create a more detailed description
of a plant or animal and to help detect inconstancies in descriptions.
New biodiversity literature is born-digital, but unfortunately most of that
text has structural or presentation markup but not semantic markup, meaning
that it cannot be used for machine-to-machine processing. TaxonX and
taXMLit are two semantic markup standards that are being used by different
projects, but application of the formats is expensive and time consuming, so
an active area of research is the development of tools to facilitate the process.
Introducing Interactive Taxonomic Keys. Biodiversity literature also
contains information to help people identify species. Such identification is
accomplished with morphological descriptions and geographic range but
also with the more structured taxonomic keys. A key identifies for the
reader distinguishing characteristics to differentiate groups of species. For
example if a reader wishes to identify a pine tree in a forest, a key might
instruct the reader to first determine the easily distinguishable characteristic
of number of needles per bundle. One set of species has five needles per
bundle; another set of pines has four needles per bundle and another three
and yet another two. Once the reader has decided the number of needles per
bundle, the key then instructs to reader to look for other characteristics such
as needle length, which will reduce the number of candidate species even
further. Given enough characteristics the reader can identify the species of
pine. This method can be error prone, however, because of the need to have
a fixed order of characteristics.
Computers have made interactive keys possible. These tools are sometimes
called multi-entry keys. Some interactive keys such as Lucid and IntKey
allow characteristics to be identified by the user in any order. Some keys are
tolerant of errors and still suggest identifications even if some characters are
entered incorrectly. Some keys dynamically reorder characteristics to suggest
the next best distinguishing characteristic based on information theory or on
an estimate of the ease for a user to see the characteristic because it is

39
CONTENTS

TOP OF ARTICLE

continued

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

continued

obvious. Since there are now many interactive key programs with different
advantages and disadvantages it is useful to be able to exchange information
among the programs. This exchange is made possible through the Structure
of Descriptive Data (SDD) standard developed by the Taxanomic Database
Working Group (TWDG), now called Biodiversity Informatics Standards.

These lists contain only names. E. O. Wilson has an even more ambitious
vision of a web page of every species on earth. The Encyclopedia of Life
project (www.eol.org) is attempting to do just that by bringing together
digital information about species from many sources. Unfortunately, there
frequently are not standards or institutions for this synthesis of information.

A Rose by Any Other Name. A rose by any other name leads to confusion.
Dardaigh only means something if we know it is Irish for rose. Roses fall
under the family Rosaceae, and there are over 100 species of rose and perhaps
thousands of varieties and cultivars. The hybrid tea rose of Valentines Day
is very different from the invasive Rosa multiflora that has taken over many
roadsides and forest clearings. It is not clear what we are talking about unless
we are much more precise with the name. This kind of confusion was why
Linnaeus developed the binomial naming system for living things. No published
list of all named species exists. Given that there are almost two million
named species, many with description, this list would be a very long book.
Unfortunately, many inconstancies have crept into the naming of species
since the time of Linnaeus. Some species have unknowingly been given
different scientific names by different researchers simply because the
second researcher was unable to find the first reference in the literature.
Sometimes they have been given different names just because scientists
disagree. The names for species and genera have changed over time as we
have learned more about species and their ancestral relationships using
phylogenetics, paleontology and other techniques.
Biodiversity informatics provides tools to attempt to create a digital
version of a complete list of species. There are multiple projects around the
world centered on different taxonomic groups that are collected by Species
2000. The goal of the Species 2000 project is to create a validated checklist
of all of the world's species including plants, animals, fungi and microbes
(www.sp2000.org). Such a list of course excludes the species that remain
unnamed and undescribed, which is the majority of species. Technologies
such as life science identifiers (LSID) and other forms of global unique
identifiers (GUID) are being tested to help untangle the name references as
they change through time.

Accessing Museum Collections


One way to organize the field of biodiversity informatics is to begin in
the field with the original collection event, field being the biologists term
for the forests, meadows, deserts, lakes, rivers, oceans and even frozen
glaciers where they work. When biologists go to the field they observe and
collect living organisms ranging in size from microbes to whales. Much of
the information about these species is gathered in museums, as we all have
seen on the plaques attached to the dinosaurs in the main display rooms of
museums around the
FIGURE 1. An example of an herbarium specimen
world. Sometimes the
specimen itself may be put
in the museum along with
information about the
specimen, ideally
including the name, date
of collection, location,
collector and
environmental information.
Sometimes it is only the
information and not the
specimen that makes it to
the museum. Figure 1 is an
example of an herbarium
specimen. Typical
information includes the
name or taxonomy of the item, the location where it was collected, the date
of collection, the name of the collector and perhaps some information about
the habitat where it was collected.

40
CONTENTS

TOP OF ARTICLE

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

While you could not tell it from the public displays of specimens, there
are billions of specimens in museums kept back in the reference collections
where professionals can use them for a large number of tasks discussed
below. These collections are the main representation of the biodiversity of
the planet, and the organization of these collections facilitates their
systematic use.
One of the chief uses of the collection is for the identification of species.
For example, there are about a million species of beetle in the world and
about 400,000 of these are named and in museums. When a scientist finds a
new specimen and is not certain of the name, not being able to memorize
hundreds of thousands of names, the scientist searches the collections for
specimens that have been named by previous entomologists. Interactive
keys do not exist for this range of species. For ease of reference and access,
Linnaeus developed a classification scheme that orders life into larger
similar groups including genus, family, order and other taxonomic levels.
Even with this and other orderings in the collection, specimens are very
difficult to find, so shortly after computers became commercially available,
scientists began constructing museum management systems geared to
biological collections. Modern descendants of these computer systems
allow scientists to search a museum by Linnaean taxonomy, location
collected and many other attributes of the specimen. Some highly functional
custom-built systems still exist including the TROPICOS system at the
Missouri Botanical Garden that is used by thousands of scientists. Most
museum management systems have been replaced by commercial or nonprofit, professionally developed systems such as KE Emu and SPECIFY.

Facilitating Data Federation


There is a key weakness in systems that only index information in a
single museum. A scientist wanting to find, for example, specimens of
Circellium bacchus, (South African dung beetles) in the past would have
needed to search the databases of each and every museum in the world to
find all specimens of interest. In the 1990s, experimental systems such as
Species Analyst were developed that provided for federated search.
Federated search meant that people could submit one query to multiple

museums all at one time. Systems like Species Analyst were inspired by the
library search federation standard Z39.50. Around 2008 members of the
TDWG developed standards for biological data federation. These standards
included Darwin Core (DwC), named to acknowledge Dublin Core (DC)
that had helped inspire it. Rather than structuring bibliographic information,
like DC, DwC structures biological collections records. DC elements such
as Creator and Date were replaced with terms relating to scientific name,
collection event, location and other pertinent information.
DwC can be serialized in XML but also in tabular and other forms. There
are discussions underway within the biodiversity informatics community to
represent this information in RDF for use on the semantic web. DwC data
can be collected in central locations using protocols such as TAPIR. The
Global Biodiversity Information Facility (GBIF) is an international effort
centered in Copenhagen to create one central reference point where
scientists and others can go to search the collections of the museums of the
world. Readers are encouraged to go to www.gbif.org to search for museum
specimen records for their favorite bird, butterfly, plant or Carabid beetle.
While there are hundreds of millions of records in GBIF, it falls far short
of the estimated billions of specimens in museums. The difficulty is that
less than 10% of specimen records are in digital format and a much smaller
percentage of this 10% is available through the Internet in standards such as
DwC. Estimates of the costs for digitizing a single specimen can vary from
$0.50 to several dollars depending on the richness of the digital record. This
cost becomes prohibitively expensive when multiplied by billions of
specimens. Consequently, one area of current biodiversity informatics
research focuses on the digitization process; that is, getting specimen
records from specimens and paper into databases. Methods such as those
used to read addresses of postal mail and other technologies have proven to
be inadequate for the diversity of data on museum specimens.

Advancing Genome-Based Species Identification


Barcode of Life. Even if we did have full access to the metadata about
museum holdings, only highly trained experts would be able to correctly
identify a beetle by comparing it to the 400,000 known species. Similar

41
CONTENTS

TOP OF ARTICLE

continued

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

problems exist with all insects and many other forms of life. DNA technology
is helping to solve the species identification problem. Full DNA sequencing
is still very expensive and only a very small set of species has been sequenced,
most frequently the so-called model organisms like mouse and Arabidopsis.
We are still a long way form the tricorder available in Star Trek, but new
techniques are being applied to drastically lower the costs of identifying a
species using DNA to below a dollar per sample using a technique called
DNA barcoding. The International Barcode of Life project [ibol.org] helps
organize efforts to create a library of relatively short (about 600 nucleotide)
DNA sequences that uniquely identify species.
While the creation of the sequences is biochemistry, the management of
the data and statistical classification of the sequences is informatics. The
Barcode of Life Data System (BOLD) now holds about 1.2 million specimen
sequences representing over 100,000 individual named species. Different
groups of organisms might need to use different parts of the DNA, but the
basic idea is the same. When someone has an unknown moth or dung beetle,
they can generate a DNA barcode for the specimen. This is compared to the
database to see if there is a matching sequence. This capability is particularly
useful for understanding the linkages of species that have different forms.
Except for some herculean efforts, which raised caterpillars to adult moths
and butterflies, science did not know which caterpillars developed into which
adult. DNA barcoding made answering this question much easier. The
technique can also help with preserving biodiversity by allowing inspectors
to test if fishing being sold in a market are indeed the species on their labels.
Metagenomic Databases. Biodiversity data collection is not limited to
single organism specimens and observations. New metagenomic techniques
make it possible to use genetic sequences to characterize the diversity of
microbes that exist in environmental conditions but have not been cultured
or characterized beyond the DNA. Work since 2006 indicates that our prior
estimates of microbial diversity were off by at least two orders of magnitude.
New computational methods allow analysis of hundreds of unique species
all mixed into the same environmental sample. The databases needed to
store the newly discovered sequences challenge the limits of digital storage
technology. New techniques are being developed not just to quantify the

number of species in a sample but also the main metabolic pathways of


these species and therefore their functional niche within the environment.
This work has profound consequences for areas such as agriculture, bioremediation, soil and ocean carbon dioxide sequestration.

Observing Species through Remote Sensing


For larger organisms biodiversity informatics tools now allow for remote
detection of species. For example, arrays of microphones can be placed in
environments of interest to record the chirping of frogs or birdcalls. Researchers
can review the recordings from anywhere on Earth to identify individual
species. In a subfield of eco-acoustics, computers can be programmed using
machine-learning techniques to automatically recognize individual species
and their relative location over time, greatly expanding the capability of
researchers and land use managers to understand biodiversity.
Informatics tools are also revolutionizing the collection and analysis of
animal behavioral data. The underlying question is where animals live and
what they do while they are there. It is now possible using radio collars to
track the movements of elephants across the African Savannas or the migration
paths of whales. When combined with visualization tools the information
can be used to warn farmers of an approaching herd of elephants or to plan
the boundaries of a new nature preserve. Miniaturization allows researchers
to attach transponders to birds and record not only location but also
vocalizations that have never before been heard. Miniaturizing information
technology even further, it is now possible to trace the movements of insects
using RFID (Radio-Frequency Identification).
While RFID allows us to follow ants, sensors on artificial satellites allow
us to get a broader view of biodiversity. Information technology is also
allowing us to study and understand larger sections of the landscape using
remote sensing. For example the Terra and Aqua satellites carry a moderate
resolution imaging spectroradiometer (MODIS) that gathers images in 36
spectral bands. This information can be used to calculate forest cover. When
combined with data from other satellites and plane-based sensors such as
light detection and ranging (LIDAR), it is possible to use computational
technology to estimate canopy heights, forest biomass and even the species of

42
CONTENTS

TOP OF ARTICLE

continued

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

individual trees. The Google Earth engine uses massively parallel computation
and remote sensing to map forest cover for large swaths of the earth as well
as many other measurements. Google will donate 10 million CPU hours
over the next two years to make the resources more readily available.

Plotting Species Distribution with Geographic Information


Systems
Desktop and server tools such as ArcView continue to be a mainstay of
geographic information system (GIS) mapping of species distributions.
Google tools such as Google Maps, Google Earth and Fusion Tables are
making GIS readily available to a much larger number of scientists and
amateurs who work to plot the distribution of species based on observation
location and museum collections. For example, Figure 2 is a map of the
collection location of specimens of the African Lavia frons, the yellowwinged bat, in museums that contributed their data through GBIF. In fact,
GBIF can export data for Google Earth. These point location maps can help
us to quickly estimate the distribution of a species or set of species. They
FIGURE 2. Collection locations of specimens of Lavia frons, the yellow-winged bat found in Africa.

can tell us where ranges overlap or where populations are isolated from one
another.

Predicting Species Distribution with Niche Modeling


Biodiversity informatics includes not only information about biological
items themselves but also about environmental conditions that can impact
biodiversity, such as temperature, rainfall, pH, dissolved oxygen in water
and many other factors. To plot the area in which we guess a species lives
we could simply draw a line around the observed locations. A method called
ecological or environmental niche modeling can provide a potentially more
accurate picture. The observation record is incomplete both because most
museum collections have not been digitized and because scientists simply
have not looked for species in all of the locations where they could be.
Because of these gaps there is no reason to assume that the species might
not also exist in a wider range than current observations suggest. Following
this line of reasoning, perhaps Lavia frons (Figure 2) actually lives a little
further west in the Congo Basin. We might also guess that it does not live in
high altitudes or other conditions unsuitable for its existence. We can use
mathematical and logical models in ecological niche modeling to
automatically and objectively identify environmental or ecological
conditions where the species may exist. In this method, known observation
data and sometimes proven absence data are combined with other
environmental information altitude where observations were made,
minimum and maximum temperature ranges, amount of rain in different
months of the year or ground cover type at the location of observation and
any number of other factors that might influence the survival of a species.
If the species has never been observed in savannas or on mountaintops,
these can be excluded from the species distribution maps thus giving a more
accurate picture of where the species might live.
Similar biodiversity informatics techniques can be applied to predict the
ranges of potentially invasive species. For example, the Burmese Python
(Python molurus bivittatus) is native to Southeast Asia but has been
introduced to southern Florida where it is reproducing in the wild and
spreading. Ecological niche models are used to predict limits to its northern

43
CONTENTS

TOP OF ARTICLE

continued

< P R E V I O U S PA G E

N E X T PA G E >

Feature

Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6

HEIDORN,

spread in the United States. Niche modeling can also be used to predict the
impact of climate change on species distribution, allowing a predictive map
of the future distribution of species. For example, the female wolverine
(Gulo gulo luscus) requires deep snows with late spring snowmelt to build
birthing dens. Current observation in northern Montana and climate
prediction models show elimination of these snow conditions even at the
highest elevations. Predictive species distribution models predict extirpation
of the species from the lower 48 states. At the same time changing rain
patterns and temperatures are changing the ranges of mosquitoes that carry
the West Nile virus.

Summary
In the future biodiversity informatics will play a critical role in
understanding biodiversity and ecosystem services of critical importance to
human health and well-being. The evolution of biodiversity informatics tools
is driven by advances in computational power, telecommunications and the
evolution of software and hardware development tools. The development of

biodiversity informatics is also governed by the questions that are being


asked by scientists, land use managers and policy makers. There will be an
expanding need to understand biodiversity at all scales from microbes to
ecosystems. Expanding databases and computational tools will allow us to
better understand microbial diversity and the role of that diversity in
ecosystem services in land, sea and air. We will need more powerful
information tools to track and quantify the distribution and movement of
individual species for the purposes of conservation, mitigation of damage
from invasive species and disease vectors and to increase and secure food
production using disease resistance and other genetic properties of wild
relatives. Biodiversity informatics will help us to understand the linked fates
of some of our most important water resources. For example, the Gulf of
Mexico at the Mississippi River delta, the Chesapeake Bay and Lake Victoria
are all suffering ecosystem and fisheries declines because of siltation and
eutrophication (nutrient surplus induced hypoxia). Information gathering,
analysis and use can help scientists, citizens and policy makers make more
informed decisions.

44
CONTENTS

TOP OF ARTICLE

continued

< P R E V I O U S PA G E

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy