Augsep11 Heidorn
Augsep11 Heidorn
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
Biodiversity Informatics
by P. Bryan Heidorn
EDITORS SUMMARY
For millennia, information about biological diversity has been collected as a way to
understand the living world. The information record has evolved from early papyrus to
modern electronic collections, but challenges to information access persist, despite and
partly due to a variety of digital techniques and applications. Information exchange is
hampered by lack of access to previous work, inconsistent naming, changes over time and
insufficient resources to create comprehensive databases supporting federated search
with Darwin Core metadata. With progress in biodiversity informatics, we will see greater
use of DNA barcoding and metagenomic techniques to describe species, remote sensing
tools and geographic information systems to detect and describe species locations and
movements and to identify habitats and environmental conditions. Biodiversity informatics
provides essential scientific knowledge to better understand global ecosystems and to
inform land use and policy decisions.
KEYWORDS
biology
informatics
scientific and technical information
38
CONTENTS
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
Linnaeus is used regularly. A scientist cannot name a new species until she
has searched the literature to insure that it does not already have a Linnaean
scientific name. Some of the publications are now hundreds of years old and
rare. Some contain now valuable art depicting plants and animals and are
much sought after by collectors.
This lasting and continuing value has led to efforts to digitize the collections
and make them available in a more cost-effective manner to scientists all over
the world, including scientists and other interested people in the developing
world, who previously did not have access to these publications unless they
traveled to Europe or the United States. The Biodiversity Heritage Library
(www.biodiversitylibrary.org/) is one such effort that has digitized 36
million pages to date. Originally a collaboration between major museum
libraries in the United States and the United Kingdom, the collaboration has
recently expanded to include much of Europe, China and Australia.
While these efforts provide unprecedented access to these materials there
are many informatics challenges, which include, among others, poor optical
character recognition (OCR) and page level access. Existing OCR technology
cannot handle the huge variation in fonts in these publications and the multiple
languages sometimes within the same document. Particularly troublesome
is the identification of scientific names since OCR was introduced. Special
purpose software tools such as TaxonFinder have improved the situation, but
the problem is far from solved. There is no global index to articles, chapter
and taxonomic treatments in these newly digitized documents so another
area of research is the identification of sections, articles and taxonomic
treatments within the publications so that improved indexes can be constructed.
Standardizing Markup and Extraction. While some people wish to read
biodiversity publications from front to back, more often people simply wish
to find facts in the publications. To address this need, biodiversity informatics
also includes the semantic markup, semantic information extraction and text
fusion from biodiversity materials. Semantic markup and extraction use
software techniques such as machine learning to add semantic tagging,
sometimes in XML, within digital documents to identify relevant facts. The
techniques might be used to identify not only treatment boundaries but also
taxonomy (scientific names) and morphological characteristics in descriptions
39
CONTENTS
TOP OF ARTICLE
continued
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
continued
obvious. Since there are now many interactive key programs with different
advantages and disadvantages it is useful to be able to exchange information
among the programs. This exchange is made possible through the Structure
of Descriptive Data (SDD) standard developed by the Taxanomic Database
Working Group (TWDG), now called Biodiversity Informatics Standards.
These lists contain only names. E. O. Wilson has an even more ambitious
vision of a web page of every species on earth. The Encyclopedia of Life
project (www.eol.org) is attempting to do just that by bringing together
digital information about species from many sources. Unfortunately, there
frequently are not standards or institutions for this synthesis of information.
A Rose by Any Other Name. A rose by any other name leads to confusion.
Dardaigh only means something if we know it is Irish for rose. Roses fall
under the family Rosaceae, and there are over 100 species of rose and perhaps
thousands of varieties and cultivars. The hybrid tea rose of Valentines Day
is very different from the invasive Rosa multiflora that has taken over many
roadsides and forest clearings. It is not clear what we are talking about unless
we are much more precise with the name. This kind of confusion was why
Linnaeus developed the binomial naming system for living things. No published
list of all named species exists. Given that there are almost two million
named species, many with description, this list would be a very long book.
Unfortunately, many inconstancies have crept into the naming of species
since the time of Linnaeus. Some species have unknowingly been given
different scientific names by different researchers simply because the
second researcher was unable to find the first reference in the literature.
Sometimes they have been given different names just because scientists
disagree. The names for species and genera have changed over time as we
have learned more about species and their ancestral relationships using
phylogenetics, paleontology and other techniques.
Biodiversity informatics provides tools to attempt to create a digital
version of a complete list of species. There are multiple projects around the
world centered on different taxonomic groups that are collected by Species
2000. The goal of the Species 2000 project is to create a validated checklist
of all of the world's species including plants, animals, fungi and microbes
(www.sp2000.org). Such a list of course excludes the species that remain
unnamed and undescribed, which is the majority of species. Technologies
such as life science identifiers (LSID) and other forms of global unique
identifiers (GUID) are being tested to help untangle the name references as
they change through time.
40
CONTENTS
TOP OF ARTICLE
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
While you could not tell it from the public displays of specimens, there
are billions of specimens in museums kept back in the reference collections
where professionals can use them for a large number of tasks discussed
below. These collections are the main representation of the biodiversity of
the planet, and the organization of these collections facilitates their
systematic use.
One of the chief uses of the collection is for the identification of species.
For example, there are about a million species of beetle in the world and
about 400,000 of these are named and in museums. When a scientist finds a
new specimen and is not certain of the name, not being able to memorize
hundreds of thousands of names, the scientist searches the collections for
specimens that have been named by previous entomologists. Interactive
keys do not exist for this range of species. For ease of reference and access,
Linnaeus developed a classification scheme that orders life into larger
similar groups including genus, family, order and other taxonomic levels.
Even with this and other orderings in the collection, specimens are very
difficult to find, so shortly after computers became commercially available,
scientists began constructing museum management systems geared to
biological collections. Modern descendants of these computer systems
allow scientists to search a museum by Linnaean taxonomy, location
collected and many other attributes of the specimen. Some highly functional
custom-built systems still exist including the TROPICOS system at the
Missouri Botanical Garden that is used by thousands of scientists. Most
museum management systems have been replaced by commercial or nonprofit, professionally developed systems such as KE Emu and SPECIFY.
museums all at one time. Systems like Species Analyst were inspired by the
library search federation standard Z39.50. Around 2008 members of the
TDWG developed standards for biological data federation. These standards
included Darwin Core (DwC), named to acknowledge Dublin Core (DC)
that had helped inspire it. Rather than structuring bibliographic information,
like DC, DwC structures biological collections records. DC elements such
as Creator and Date were replaced with terms relating to scientific name,
collection event, location and other pertinent information.
DwC can be serialized in XML but also in tabular and other forms. There
are discussions underway within the biodiversity informatics community to
represent this information in RDF for use on the semantic web. DwC data
can be collected in central locations using protocols such as TAPIR. The
Global Biodiversity Information Facility (GBIF) is an international effort
centered in Copenhagen to create one central reference point where
scientists and others can go to search the collections of the museums of the
world. Readers are encouraged to go to www.gbif.org to search for museum
specimen records for their favorite bird, butterfly, plant or Carabid beetle.
While there are hundreds of millions of records in GBIF, it falls far short
of the estimated billions of specimens in museums. The difficulty is that
less than 10% of specimen records are in digital format and a much smaller
percentage of this 10% is available through the Internet in standards such as
DwC. Estimates of the costs for digitizing a single specimen can vary from
$0.50 to several dollars depending on the richness of the digital record. This
cost becomes prohibitively expensive when multiplied by billions of
specimens. Consequently, one area of current biodiversity informatics
research focuses on the digitization process; that is, getting specimen
records from specimens and paper into databases. Methods such as those
used to read addresses of postal mail and other technologies have proven to
be inadequate for the diversity of data on museum specimens.
41
CONTENTS
TOP OF ARTICLE
continued
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
problems exist with all insects and many other forms of life. DNA technology
is helping to solve the species identification problem. Full DNA sequencing
is still very expensive and only a very small set of species has been sequenced,
most frequently the so-called model organisms like mouse and Arabidopsis.
We are still a long way form the tricorder available in Star Trek, but new
techniques are being applied to drastically lower the costs of identifying a
species using DNA to below a dollar per sample using a technique called
DNA barcoding. The International Barcode of Life project [ibol.org] helps
organize efforts to create a library of relatively short (about 600 nucleotide)
DNA sequences that uniquely identify species.
While the creation of the sequences is biochemistry, the management of
the data and statistical classification of the sequences is informatics. The
Barcode of Life Data System (BOLD) now holds about 1.2 million specimen
sequences representing over 100,000 individual named species. Different
groups of organisms might need to use different parts of the DNA, but the
basic idea is the same. When someone has an unknown moth or dung beetle,
they can generate a DNA barcode for the specimen. This is compared to the
database to see if there is a matching sequence. This capability is particularly
useful for understanding the linkages of species that have different forms.
Except for some herculean efforts, which raised caterpillars to adult moths
and butterflies, science did not know which caterpillars developed into which
adult. DNA barcoding made answering this question much easier. The
technique can also help with preserving biodiversity by allowing inspectors
to test if fishing being sold in a market are indeed the species on their labels.
Metagenomic Databases. Biodiversity data collection is not limited to
single organism specimens and observations. New metagenomic techniques
make it possible to use genetic sequences to characterize the diversity of
microbes that exist in environmental conditions but have not been cultured
or characterized beyond the DNA. Work since 2006 indicates that our prior
estimates of microbial diversity were off by at least two orders of magnitude.
New computational methods allow analysis of hundreds of unique species
all mixed into the same environmental sample. The databases needed to
store the newly discovered sequences challenge the limits of digital storage
technology. New techniques are being developed not just to quantify the
42
CONTENTS
TOP OF ARTICLE
continued
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
individual trees. The Google Earth engine uses massively parallel computation
and remote sensing to map forest cover for large swaths of the earth as well
as many other measurements. Google will donate 10 million CPU hours
over the next two years to make the resources more readily available.
can tell us where ranges overlap or where populations are isolated from one
another.
43
CONTENTS
TOP OF ARTICLE
continued
< P R E V I O U S PA G E
N E X T PA G E >
Feature
Bulletin of the American Society for Information Science and Technology August/September 2011 Volume 37, Number 6
HEIDORN,
spread in the United States. Niche modeling can also be used to predict the
impact of climate change on species distribution, allowing a predictive map
of the future distribution of species. For example, the female wolverine
(Gulo gulo luscus) requires deep snows with late spring snowmelt to build
birthing dens. Current observation in northern Montana and climate
prediction models show elimination of these snow conditions even at the
highest elevations. Predictive species distribution models predict extirpation
of the species from the lower 48 states. At the same time changing rain
patterns and temperatures are changing the ranges of mosquitoes that carry
the West Nile virus.
Summary
In the future biodiversity informatics will play a critical role in
understanding biodiversity and ecosystem services of critical importance to
human health and well-being. The evolution of biodiversity informatics tools
is driven by advances in computational power, telecommunications and the
evolution of software and hardware development tools. The development of
44
CONTENTS
TOP OF ARTICLE
continued
< P R E V I O U S PA G E