Copyright © 2016 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and document use rules apply.
This document provides Best Practices related to the publication and usage of data on the Web designed to help support a self-sustaining ecosystem. Data should be discoverable and understandable by humans and machines. Where data is used in some way, whether by the originator of the data or by an external party, such usage should also be discoverable and the efforts of the data publisher recognized. In short, following these Best Practices will facilitate interaction between publishers and consumers.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
The
Working
Group
invites
publishers
to
test
whether
their
datasets
pass
or
fail
has
demonstrated
evidence
that
each
of
the
Best
Practices
best
practices
has
been
recommended
or
adopted
in
this
document
using
the
DWBP
Evidences
Form
.
The
information
gathered
through
this
means
will
be
augmented
by
further
analysis
at
least
two
environments,
such
as
data
portals
and
formal
policies.
Evidence
of
implementation
was
gathered
from
existing
datasets
and
data
available
on
portals
that
already
implement
the
Web
and,
proposed
best
practices,
as
noted
in
the
charter
,
well
as
from
national
or
sector-specific
guidelines
that
reference
the
Best
Practices.
The
Working
Group
expects
to
adduce
DWBP
and
documents
available
on
the
combined
set
of
evidence
when
requesting
that
Web.
Well-known
organizations,
as
well
as
high
profile
datasets
and
data
portals
worldwide
like
DBpedia
,
Data.gov.uk
,
Data.gov
and
the
Director
advance
World
Bank
,
follow
the
Best
Practices
set
out
in
this
document
where
relevant
to
Proposed
Recommendation.
their
specific
operations.
This
document
was
published
by
the
Data
on
the
Web
Best
Practices
Working
Group
as
a
Candidate
Proposed
Recommendation.
This
document
is
intended
to
become
a
W3C
Recommendation.
If
you
wish
The
W3C
Membership
and
other
interested
parties
are
invited
to
make
comments
regarding
this
document,
please
review
the
document
and
send
them
comments
to
public-dwbp-comments@w3.org
(
subscribe
,
archives
).
W3C
publishes
a
Candidate
Recommendation
to
indicate
)
through
15
January
2017.
Advisory
Committee
Representatives
should
consult
their
WBS
questionnaires
.
Note
that
substantive
technical
comments
were
expected
during
the
document
is
believed
to
be
stable
and
to
encourage
implementation
by
the
developer
community.
This
Candidate
Recommendation
is
expected
to
advance
to
Proposed
Recommendation
no
earlier
than
review
period
that
ended
31
October
2016.
All
comments
are
welcome.
Please see the Working Group's implementation report .
Publication
as
a
Candidate
Proposed
Recommendation
does
not
imply
endorsement
by
the
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
is
inappropriate
to
cite
this
document
as
other
than
work
in
progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 September 2015 W3C Process Document .
This section is non-normative.
The Best Practices described below have been developed to encourage and enable the continued expansion of the Web as a medium for the exchange of data. The growth in online sharing of open data by governments across the world [ OKFN-INDEX ] [ ODB ], the increasing online publication of research data encouraged by organizations like the Research Data Alliance [ RDA ], the harvesting, analysis and online publishing of social media data, crowd-sourcing of information, the increasing presence on the Web of important cultural heritage collections such as at the Bibliothèque nationale de France [ BNF ] and the sustained growth in the Linked Open Data Cloud [ LODC ], provide some examples of this growth in the use of Web for publishing data.
However, this growth is not consistent in style and in many cases does not make use of the full potential of the Open Web Platform's ability to link one fact to another, to discover related resources and to create interactive visualizations.
In broad terms, data publishers aim to share data either openly or with controlled access. Data consumers (who may also be producers themselves) want to be able to find, use and link to the data, especially if it is accurate, regularly updated and guaranteed to be available at all times. This creates a fundamental need for a common understanding between data publishers and data consumers. Without this agreement, data publishers' efforts may be incompatible with data consumers' desires.
The openness and flexibility of the Web create new challenges for data publishers and data consumers, such as how to represent, describe and make data available in a way that it will be easy to find and to understand. In contrast to conventional databases, for example, where there is a single data model to represent the data and a database management system (DBMS) to control data access, data on the Web allows for the existence of multiple ways to represent and to access data. For more details about the challenges see the section Data on the Web Challenges .
In this context, it becomes crucial to provide guidance to publishers that will improve consistency in the way data is managed. Such guidance will promote the reuse of data and foster trust in the data among developers, whatever technology they choose to use, increasing the potential for genuine innovation.
Not all data and metadata should be shared openly, however. Security, commercial sensitivity and, above all, individuals' privacy need to be taken into account. It is for data publishers to determine policy on which data should be shared and under what circumstances. Data sharing policies are likely to assess the exposure risk and determine the appropriate security measures to be taken to protect sensitive data, such as secure authentication and authorization.
Depending on circumstances, sensitive information about individuals might include full name, home address, email address, national identification number, IP address, vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc. Although it is likely to be safe to share some of that information openly, and even more within a controlled environment, publishers should bear in mind that combining data from multiple sources may allow inadvertent identification of individuals.
A general Best Practice for publishing Data on the Web is to use standards. Different types of organizations specify standards that are specific to the publishing of datasets related to particular domains & applications, involving communities of users interested in that data. These standards define a common way of communicating information among the users of these communities. For example, there are two standards that can be used to publish transport timetables: the General Transit Feed Specification [ GTFS ] and the Service Interface for Real Time Information [ SIRI ]. These specify, in a mixed way, standardized terms, standardized data formats and standardized data access. Another general Best Practice is to use Unicode for handling character and string data. Unicode improves multilingual text processing and makes easier software localization easier. The Best Practices set out in this document serve a general purpose of publishing and using Data on the Web and are domain & application independent. They can be extended or complemented by other Best Practices documents or standards that cover more specialized contexts.
Best Practices cover different aspects related to data publishing and consumption, like data formats, data access, data identifiers and metadata. In order to delimit the scope and elicit the required features for Data on the Web Best Practices, the DWBP working group compiled a set of use cases [ DWBP-UCR ] that represent scenarios of how data is commonly published on the Web and how it is used. The set of requirements derived from these use cases were used to guide the development of the Best Practices.
The Best Practices proposed in this document are intended to serve a more general purpose than the practices suggested in, for example, Best Practices for Publishing Linked Data [ LD-BP ] since DWBP is domain-independent. Whilst DWBP recommends the use of Linked Data, it also promotes best practices for data on the Web in other open formats such as CSV. Methods for sharing tabular data, including CSV files, in a way that maximizes the potential of the Web to make links between data points, are described in the Tabular Data Primer [ Tabular-Data-Primer ].
In order to encourage data publishers to adopt the DWBP , a number of distinct benefits were identified: comprehension; processability; discoverability; reuse; trust; linkability; access; and interoperability. They are described and related to the Best Practices in the section Best Practices Benefits .
This section is non-normative.
This document sets out Best Practices tailored primarily for those who publish data on the Web. The Best Practices are designed to meet the needs of information management staff, developers, and wider groups such as scientists interested in sharing and reusing research data on the Web. While data publishers are our primary audience, we encourage all those engaged in related activities to become familiar with it. Every attempt has been made to make the document as readable and usable as possible while still retaining the accuracy and clarity needed in a technical specification.
Readers of this document are expected to be familiar with some fundamental concepts of the architecture of the Web [ WEBARCH ], such as resources and URIs, as well as a number of data formats. The normative element of each Best Practice is the intended outcome . Possible implementations are suggested and, where appropriate, these recommend the use of a particular technology. A basic knowledge of vocabularies and data models would be helpful to better understand some aspects of this document.
This section is non-normative.
This document is concerned solely with Best Practices that:
As noted above, whether a Best Practice has or has not been followed should be judged against the intended outcome , not the possible approach to implementation which is offered as guidance. A best practice is always subject to improvement as we learn and evolve the Web together.
This section is non-normative.
The following diagram illustrates the context considered in this document. In general, the Best Practices proposed for publication and usage of Data on the Web refer to datasets and distributions . Data is published in different distributions, which are specific physical form of a dataset. By data, "we mean known facts that can be recorded and that have implicit meaning" [ Navathe ]. These distributions facilitate the sharing of data on a large scale, which allows datasets to be used for several groups of data consumers , without regard to purpose, audience, interest, or license. Given this heterogeneity and the fact that data publishers and data consumers may be unknown to each other, it is necessary to provide some information about the datasets and distributions that may also contribute to trustworthiness and reuse, such as: structural metadata, descriptive metadata, access information, data quality information, provenance information, license information and usage information.
An important aspect of publishing and sharing data on the Web concerns the architectural basis of the Web [ WEBARCH ]. A relevant aspect of this is the identification principle that says that URIs should be used to identify resources. In our context, a resource may be a whole dataset or a specific item of given dataset. All resources should be published with stable URIs, so that they can be referenced and make links, via URIs, between two or more resources. Finally, to promote the interoperability among datasets it is important to adopt data vocabularies and standards.
This section is non-normative.
The following namespace prefixes are used throughout this document.
Prefix | Namespace IRI |
---|---|
dcat | http://www.w3.org/ns/dcat# |
dct | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
foaf | http://xmlns.com/foaf/0.1/ |
oa | http://www.w3.org/ns/oa# |
owl | http://www.w3.org/2002/07/owl# |
pav | http://pav-ontology.github.io/pav/ |
prov | http://www.w3.org/ns/prov# |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs | http://www.w3.org/2000/01/rdf-schema# |
skos | http://www.w3.org/2004/02/skos/core# |
This section presents the template used to describe Data on the Web Best Practices.
Best Practice Template
Short description of the BP
This section answers two crucial questions:
A full text description of the problem addressed by the Best Practice may also be provided. It can be any length but is likely to be no more than a few sentences.
What it should be possible to do when a data publisher follows the Best Practice.
A description of a possible implementation strategy is provided. This represents the best advice available at the time of writing but specific circumstances and future developments may mean that alternative implementation methods are more appropriate to achieve the intended outcome.
Information on how to test the BP has been met. This might or might not be machine testable.
Information about the relevance of the BP. It is described by one or more relevant requirements as documented in the Data on the Web Best Practices Use Cases & Requirements document [ DWBP-UCR ]
This section contains the Best Practices to be used by data publishers in order to help them and data consumers to overcome the different challenges faced when publishing and consuming data on the Web. One or more Best Practices were proposed for each one of the challenges, which are described in the section Data on the Web Challenges .
Each BP is related to one or more requirements from the Data on the Web Best Practices Use Cases & Requirements document [ DWBP-UCR ] which guided their development. Each Best Practice has at least one of these requirements as evidence of its relevance.
RDF examples of the application of some Best Practices are shown using Turtle [ Turtle ] or JSON-LD [ JSON-LD ].
The Web is an open information space, where the absence of a specific context, such a company's internal information system, means that the provision of metadata is a fundamental requirement. Data will not be discoverable or reusable by anyone other than the publisher if insufficient metadata is provided. Metadata provides additional information that helps data consumers better understand the meaning of data, its structure, and to clarify other issues, such as rights and license terms, the organization that generated the data, data quality, data access methods and the update schedule of datasets. Publishers are encouraged to provide human-readable information in multiple languages, and, as much as possible, provide the information in the language(s) that the intended users will understand.
Metadata can be used to help tasks such as dataset discovery and reuse, and can be assigned considering different levels of granularity from a single property of a resource to a whole dataset, or all datasets from a specific organization. Metadata can also be of different types. These types can be classified in different taxonomies, with different grouping criteria. For example, a specific taxonomy could define three metadata types according to descriptive, structural and administrative features. A different taxonomy could define metadata types with a scheme according to tasks where metadata are used, for example, discovery and reuse.
Best Practice 1: Provide metadata
Provide metadata for both human users and computer applications.
Providing metadata is a fundamental requirement when publishing data on the Web because data publishers and data consumers may be unknown to each other. Then, it is essential to provide information that helps human users and computer applications to understand the data as well as other important aspects that describes a dataset or a distribution.
Humans will be able to understand the metadata and computer applications, notably user agents, will be able to process it.
Possible
approaches
to
provide
human
readable
human-readable
metadata:
Possible approaches to provide machine-readable metadata:
Check
if
human
readable
human-readable
metadata
is
available.
Check if the metadata is available in a valid machine-readable format and without syntax error.
Relevant requirements : R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead
Best Practice 2: Provide descriptive metadata
Provide metadata that describes the overall features of datasets and distributions.
Explicitly providing dataset descriptive information allows user agents to automatically discover datasets available on the Web and it allows humans to understand the nature of the dataset and its distributions.
Humans will be able to interpret the nature of the dataset and its distributions, and software agents will be able to automatically discover datasets and distributions.
Descriptive metadata can include the following overall features of a dataset:
Descriptive metadata can include the following overall features of a distribution:
The machine-readable version of the descriptive metadata can be provided using the vocabulary recommended by W3C to describe datasets, i.e. the Data Catalog Vocabulary [ VOCAB-DCAT ]. This provides a framework in which datasets can be described as abstract entities.
Check if the metadata for the dataset itself includes the overall features of the dataset in a human-readable format.
Check if the descriptive metadata is available in a valid machine-readable format.
Relevant requirements : R-MetadataAvailable , R-MetadataMachineRead , R-MetadataStandardized
Best Practice 3: Provide structural metadata
Provide metadata that describes the schema and internal structure of a distribution.
Providing information about the internal structure of a distribution is essential for others wishing to explore or query the dataset. It also helps people to understand the meaning of the data.
Humans will be able to interpret the schema of a dataset and software agents will be able to automatically process distributions.
Human
readable
Human-readable
structural
metadata
usually
provides
the
properties
or
columns
of
the
dataset
schema.
Machine-readable
structural
metadata
is
available
according
to
the
format
of
a
specific
distribution
and
it
may
be
provided
within
separate
documents
or
embedded
into
the
document.
For
more
details
details,
see
the
links
below.
Check
if
the
structural
metadata
of
the
dataset
is
provided
in
a
human-readable
format.
Check if the metadata of the distribution includes structural information about the dataset in a machine-readable format and without syntax errors.
Relevant requirements : R-MetadataAvailable
A license is a very useful piece of information to be attached to data on the Web. According to the type of license adopted by the publisher, there might be more or fewer restrictions on sharing and reusing data. In the context of data on the Web, the license of a dataset can be specified within the metadata, or outside of it, in a separate document to which it is linked.
Best Practice 4: Provide data license information
Provide a link to or copy of the license agreement that controls use of the data.
The presence of license information is essential for data consumers to assess the usability of data. User agents may use the presence/absence of license information as a trigger for inclusion or exclusion of data presented to a potential consumer.
Humans will be able to understand data license information describing possible restrictions placed on the use of a given distribution and software agents to automatically detect the data license of a distribution.
Data license information can be available via a link to, or embedded copy of, a human-readable license agreement. It can also be made available for processing via a link to, or embedded copy of, a machine-readable license agreement.
One of the following vocabularies that include properties for linking to a license can be used:
dct:license
)
cc:license
)
schema:license
)
xhtml:license
)
There are also a number of machine-readable rights languages, including:
Check if the metadata for the dataset itself includes the data license information in a human-readable format.
Check if a user agent can automatically detect /discover the data license of the dataset.
Relevant use cases : R-LicenseAvailable , R-MetadataMachineRead , R-LicenseLiability
The
Web
brings
together
business,
engineering,
and
scientific
communities
creating
collaborative
opportunities
that
were
previously
unimaginable.
The
challenge
in
publishing
data
on
the
Web
is
providing
an
appropriate
level
of
detail
about
its
origin.
The
data
producer
may
not
necessarily
be
the
data
provider
publisher
and
so
collecting
and
conveying
this
corresponding
metadata
is
particularly
important.
Without
provenance
,
consumers
have
no
inherent
way
to
trust
the
integrity
and
credibility
of
the
data
being
shared.
Data
publishers
in
turn
need
to
be
aware
of
the
needs
of
prospective
consumer
communities
to
know
how
much
provenance
detail
is
appropriate.
Best Practice 5: Provide data provenance information
Provide complete information about the origins of the data and any changes you have made.
Provenance is one means by which consumers of a dataset judge its quality. Understanding its origin and history helps one determine whether to trust the data and provides important interpretive context.
Humans
will
know
the
origin
or
and
history
of
the
dataset
and
software
agents
will
be
able
to
automatically
process
provenance
information.
The machine-readable version of the data provenance can be provided using an ontology recommended to describe provenance information, such as W3C 's Provenance Ontology [ PROV-O ].
Check that the metadata for the dataset itself includes the provenance information about the dataset in a human-readable format.
Check if a computer application can automatically process the provenance information about the dataset.
Relevant requirements : R-ProvAvailable , R-MetadataAvailable
The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of data quality information in data publishing and consumption pipelines is of primary importance. Usually, the assessment of quality involves different kinds of quality dimensions, each representing groups of characteristics that are relevant to publishers and consumers. The Data Quality Vocabulary defines concepts such as measures and metrics to assess the quality for each quality dimension [ VOCAB-DQV ]. There are heuristics designed to fit specific assessment situations that rely on quality indicators, namely, pieces of data content, pieces of data meta-information, and human ratings that give indications about the suitability of data for some intended use.
Best Practice 6: Provide data quality information
Provide information about data quality and fitness for particular purposes.
Data quality might seriously affect the suitability of data for specific applications, including applications very different from the purpose for which it was originally generated. Documenting data quality significantly eases the process of dataset selection, increasing the chances of reuse. Independently from domain-specific peculiarities, the quality of data should be documented and known quality issues should be explicitly stated in metadata.
Humans and software agents will be able to assess the quality and therefore suitability of a dataset for their application.
The machine-readable version of the dataset quality metadata may be provided using the Data Quality Vocabulary developed by the DWBP working group [ VOCAB-DQV ].
Check that the metadata for the dataset itself includes quality information about the dataset.
Check if a computer application can automatically process the quality information about the dataset.
Relevant Requirements: R-QualityMetrics , R-DataMissingIncomplete , R-QualityOpinions
Datasets published on the Web may change over time. Some datasets are updated on a scheduled basis, and other datasets are changed as improvements in collecting the data make updates worthwhile. In order to deal with these changes, new versions of a dataset may be created. Unfortunately, there is no consensus about when changes to a dataset should cause it to be considered a different dataset altogether rather than a new version. In the following, we present some scenarios where most publishers would agree that the revision should be considered a new version of the existing dataset.
In general, multiple datasets that represent time series or spatial series, e.g. the same kind of data for different regions or for different years, are not considered multiple versions of the same dataset. In this case, each dataset covers a different set of observations about the world and should be treated as a new dataset. This is also the case with a dataset that collects data about weekly weather forecasts for a given city, where every week a new dataset is created to store data about that specific week.
Scenarios 1 and 2 might trigger a major version, whereas Scenario 3 would likely trigger only a minor version. But how you decide whether versions are minor or major is less important than that you avoid making changes without incrementing the version indicator. Even for small changes, it is important to keep track of the different dataset versions to make the dataset trustworthy. Publishers should remember that a given dataset may be in use by one or more data consumers, and they should take reasonable steps to inform those consumers when a new version is released. For real-time data, an automated timestamp can serve as a version identifier. For each dataset, the publisher should take a consistent, informative approach to versioning, so data consumers can understand and work with the changing data.
Best Practice 7: Provide a version indicator
Assign and indicate a version number or date for each dataset.
Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ.
Humans and software agents will easily be able to determine which version of a dataset they are working with.
The best method for providing versioning information will vary according to the context; however, there are some basic guidelines that can be followed, for example:
The Web Ontology Language [ OWL2-QUICK-REFERENCE ] and the Provenance, Authoring and versioning Ontology [ PAV ] provide a number of annotation properties for version information.
Check if the metadata for the dataset/distribution provides a unique version number or date in a human-readable format.
Check if a computer application can automatically detect/discover and process the unique version number or date of a dataset or distribution.
Relevant requirements : R-DataVersion
Best Practice 8: Provide version history
Provide a complete version history that explains the changes made in each version.
In creating applications that use data, it can be helpful to understand the variability of that data over time. Interpreting the data is also enhanced by an understanding of its dynamics. Determining how the various versions of a dataset differ from each other is typically very laborious unless a summary of the differences is provided.
Humans and software agents will be able to understand how the dataset typically changes from version to version and how any two specific versions differ.
Provide a list of published versions and a description for each version that explains how it differs from the previous version. An API can expose a version history with a single dedicated URL that retrieves the latest version of the complete history.
Check that a list of published versions is available as well as a change log describing precisely how each version differs from the previous one.
Relevant requirements : R-DataVersion
Identifiers take many forms and are used extensively in every information system. Data discovery, usage and citation on the Web depends fundamentally on the use of HTTP (or HTTPS) URIs: globally unique identifiers that can be looked up by dereferencing them over the Internet [ RFC3986 ]. It is perhaps worth emphasizing some key points about URIs in the current context.
Best Practice 9: Use persistent URIs as identifiers of datasets
Identify each dataset by a carefully chosen, persistent URI.
Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and reuse.
Developers may build URIs into their code and so it is important that those URIs persist and that they dereference to the same resource over time without the need for human intervention.
Datasets or information about datasets will be discoverable and citable through time, regardless of the status, availability or format of the data.
To be persistent, URIs must be designed as such. A lot has been written on this topic, see, for example, the European Commission's Study on Persistent URIs [ PURI ] which in turn links to many other resources.
Where a data publisher is unable or unwilling to manage a URI space directly for persistence, an alternative approach is to use a redirection service such as Permanent Identifiers for the Web or purl.org . These provide persistent URIs that can be redirected as required so that the eventual location can be ephemeral. The software behind such services is freely available so that it can be installed and managed locally if required.
Digital
Object
Identifiers
(
DOI
s)
offer
a
similar
alternative.
These
identifiers
are
defined
independently
of
any
Web
technology
but
can
be
appended
to
a
'URI
stub.'
DOIs
are
an
important
part
of
the
digital
infrastructure
for
research
data
and
and
libraries.
Check that each dataset is identified using a URI that has been designed for persistence. Ideally the relevant Web site includes a description of the design scheme and a credible pledge of persistence should the publisher no longer be able to maintain the URI space themselves.
Relevant requirements : R-UniqueIdentifier , R-Citable
Best Practice 10: Use persistent URIs as identifiers within datasets
Reuse other people's URIs as identifiers within datasets where possible.
The power of the Web lies in the Network effect . The first telephone only became useful when the second telephone meant there was someone to call; the third telephone made both of them more useful yet. Data becomes more valuable if it refers to other people's data about the same thing, the same place, the same concept, the same event, the same person, and so on. That means using the same identifiers across datasets and making sure that your identifiers can be referred to by other datasets. When those identifiers are HTTP URIs, they can be looked up and more data discovered.
These ideas are at the heart of the 5 Stars of Linked Data where one data point links to another, and of Hypermedia where links may be to further data or to services that can act on or relate to the data in some way.
That's the Web of Data.
Data items will be related across the Web creating a global information space accessible to humans and machines alike.
This is a topic in itself and a general document such as this can only include superficial detail.
Developers
know
that
very
often
the
problem
they're
they
are
trying
to
solve
will
have
already
been
solved
by
other
people.
In
the
same
way,
if
you're
you
are
looking
for
a
set
of
identifiers
for
obvious
things
like
countries,
currencies,
subjects,
species,
proteins,
cities
and
regions,
Nobel
prize
winners
and
products
–
someone's
done
it
already.
The
steps
described
for
discovering
existing
vocabularies
[
LD-BP
]
can
readily
be
adapted.
If
you
can't
can
not
find
an
existing
set
of
identifiers
that
meet
your
needs
then
you'll
you
will
need
to
create
your
own,
following
the
patterns
for
URI
persistence
so
that
others
will
add
value
to
your
data
by
linking
to
it.
Check
that
within
the
dataset,
references
to
things
that
don't
do
not
change
or
that
change
slowly,
such
as
countries,
regions,
organizations
and
people,
are
referred
to
by
URIs
or
by
short
identifiers
that
can
be
appended
to
a
URI
stub.
Ideally
the
URIs
should
resolve,
however,
they
have
value
as
globally
scoped
variables
whether
they
resolve
or
not.
Relevant requirements : R-UniqueIdentifier
Best Practice 11: Assign URIs to dataset versions and series
Assign URIs to individual versions of datasets as well as to the overall series.
Like documents, many datasets fall into natural series or groups. For example:
In different circumstances, it will be appropriate to refer to the current situation (the current set of bus stops, the current elected officials etc.). In others, it may be appropriate to refer to the situation as it existed at a specific time.
Humans and software agents will be able to refer to specific versions of a dataset and to concepts such as a 'dataset series' and 'the latest version'.
The
W3C
provides
a
good
example
of
how
to
do
this.
The
(persistent)
URI
for
this
document
is
http://www.w3.org/TR/2015/WD-dwbp-20150224/.
https://www.w3.org/TR/2015/PR-dwbp-20161215/
.
That
identifier
points
to
an
immutable
snapshot
of
the
document
on
the
day
of
its
publication.
The
URI
for
the
'latest
version'
of
this
document
is
http://www.w3.org/TR/dwbp/
https://www.w3.org/TR/dwbp/
which
is
an
identifier
for
a
series
of
closely
related
documents
that
are
subject
to
change
over
time.
At
the
time
of
publication,
these
two
URIs
both
resolve
to
this
document.
However,
when
the
next
version
of
this
document
is
published,
the
'latest
version'
URI
will
be
changed
to
point
to
that,
but
the
dated
URI
remains
unchanged.
Check that each version of a dataset has its own URI, and that there is also a "latest version" URI.
Relevant requirements : R-UniqueIdentifier , R-Citable
The format in which data is made available to consumers is a key aspect of making that data usable. The best, most flexible access mechanism in the world is pointless unless it serves data in formats that enable use and reuse. Below we detail Best Practices in selecting formats for your data, both at the level of files and that of individual fields. W3C encourages the use of formats that can be used by the widest possible audience and processed most readily by computing systems. Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope. This document is concerned with what is actually published rather than internal systems used to generate the published data.
Best Practice 12: Use machine-readable standardized data formats
Make data available in a machine-readable, standardized data format that is well suited to its intended or potential use.
As data becomes more ubiquitous, and datasets become larger and more complex, processing by computers becomes ever more crucial. Posting data in a format that is not machine-readable places severe limitations on the continuing usefulness of the data. Data becomes useful when it has been processed and transformed into information. Note that there is an important distinction between formats that can be read and edited by humans using a computer and formats that are machine-readable. The latter term implies that the data is readily extracted, transformed and processed by a computer.
Using non-standard data formats is costly and inefficient, and the data may lose meaning as it is transformed. On the other hand, standardized data formats enable interoperability as well as future uses, such as remixing or visualization, many of which cannot be anticipated when the data is first published. It is also important to note that most machine-readable standardized formats are also locale-neutral.
Machines will easily be able to read and process data published on the Web and humans will be able to use computational tools typically available in the relevant domain to work with the data.
Make data available in a machine-readable standardized data format that is easily parseable including but not limited to CSV, XML, HDF5, JSON and RDF serialization syntaxes like RDF/XML, JSON-LD, Turtle.
Check if the data format conforms to a known machine-readable data format specification.
Relevant requirements : R-FormatMachineRead , R-FormatStandardized R-FormatOpen
Best Practice 13: Use locale-neutral data representations
Use locale-neutral data structures and values, or, where that is not possible, provide metadata about the locale used by data values.
Data values that are machine-readable and not specific to any particular language or culture are more durable and less open to misinterpretation than values that use one of the many different cultural representations. Things like dates, currencies and numbers may look similar but have different meanings in different locales. For example, the 'date' 4/7 can be read as 7th of April or the 4th of July depending on where the data was created. Similarly, €2,000 is either two thousand Euros or an over-precise representation of two Euros. By using a locale-neutral format, systems avoid the need to establish specific interchange rules that vary according to the language or location of the user. When the data is already in a locale-specific format, making the locale and language explicit by providing locale parameters allows users to determine how readily they can work with the data and may enable automated translation services.
Humans and software agents will be able to interpret the meaning of strings representing dates, times, currencies and numbers etc. accurately.
Most
common
data
serialization
formats
are
locale-neutral.
For
example,
XML
Schema
types
such
as
xsd:integer
and
xsd:date
are
intended
for
locale-neutral
data
interchange.
Using
locale-neutral
representations
allows
the
data
values
to
be
processed
accurately
without
complex
parsing
or
misinterpretation
and
also
allows
the
data
to
be
presented
in
the
format
most
comfortable
for
the
consumer
of
the
data
in
any
locale.
For
example,
rather
than
storing
"€2000,00"
as
a
string,
it's
it
is
strongly
preferred
to
exchange
a
data
structure
such
as:
…
"price" {
"value": 2000.00,
"currency": "EUR"
}
…
Some datasets contain values that are not or cannot be rendered into a locale-neutral format. This is particularly true of any natural language text values. For each data field that can contain locale-affected or natural-language text, there should be an associated language tag used to indicate the language and locale of the data. This locale information can be used in parsing the data or to ensure proper presentation and processing of the value by the consumer. BCP 47 [ BCP47 ] provides the standard for language and locale identification and, informatively, CLDR [ CLDR ] is the source for both representing specific localized formats and as a reference for specific locale data values.
Check that locale-sensitive data values are represented in a locale-neutral format or that, if this is not possible, relevant locale metadata is provided.
Relevant requirements : R-FormatLocalize , R-MetadataAvailable , R-GeographicalContext , R-FormatMachineRead
Best Practice 14: Provide data in multiple formats
Make data available in multiple formats when more than one format suits its intended or potential use.
Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.
As many users as possible will be able to use the data without first having to transform it into their preferred format.
Consider the data formats most likely to be needed and consider alternatives that are likely to be useful in the future. Data publishers must balance the effort required to make the data available in many formats against the cost of doing so, but providing at least one alternative will greatly increase the usability of the data. In order to serve data in more than one format you can use content negotiation as described in Best Practice Use content negotiation for serving data available in multiple formats.
A word of warning: local identifiers within the dataset, which may be exposed as fragment identifiers in URIs, must be consistent across the various formats.
Check if the complete dataset is available in more than one data format.
Relevant requirements : R-FormatMultiple
Vocabularies define the concepts and relationships (also referred to as “terms” or “attributes”) used to describe and represent an area of interest. They are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several near-synonyms for 'vocabulary' have been coined, for example, ontology, controlled vocabulary, thesaurus, taxonomy, code list, semantic network.
There is no strict division between the artifacts referred to by these names. “Ontology” tends however to denote the vocabularies of classes and properties that structure the descriptions of resources in (linked) datasets. In relational databases, these correspond to the names of tables and columns; in XML, they correspond to the elements defined by an XML Schema. Ontologies are the key building blocks for inference techniques on the Semantic Web. The first means offered by W3C for creating ontologies is the RDF Schema [ RDF-SCHEMA ] language. It is possible to define more expressive ontologies with additional axioms using languages such as those in The Web Ontology Language [ OWL2-OVERVIEW ].
On the other hand, “controlled vocabularies”, “concept schemes” and “knowledge organization systems” enumerate and define resources that can be employed in the descriptions made with the former kind of vocabulary, i.e. vocabularies that structure the descriptions of resources in (linked) datasets. A concept from a thesaurus, say, “architecture”, will for example be used in the subject field for a book description (where “subject” has been defined in an ontology for books). For defining the terms in these vocabularies, complex formalisms are most often not needed. Simpler models have thus been proposed to represent and exchange them, such as the ISO 25964 data model [ ISO-25964 ] or W3C 's Simple Knowledge Organization System [ SKOS-PRIMER ].
Best Practice 15: Reuse vocabularies, preferably standardized ones
Use terms from shared vocabularies, preferably standardized ones, to encode data and metadata.
Use of vocabularies already in use by others captures and facilitates consensus in communities. It increases interoperability and reduces redundancies, thereby encouraging reuse of your own data. In particular, the use of shared vocabularies for metadata (especially structural, provenance, quality and versioning metadata) helps the comparison and automatic processing of both data and metadata. In addition, referring to codes and terms from standards helps to avoid ambiguity and clashes between similar elements or values.
Interoperability and consensus among data publishers and consumers will be enhanced.
The Vocabularies section of the W3C Best Practices for Publishing Linked Data [ LD-BP ] provides guidance on the discovery, evaluation and selection of existing vocabularies.
Organizations
such
as
the
Open
Geospatial
Consortium
(OGC),
ISO
,
W3C
,
WMO
,
libraries
and
research
data
services,
etc.
provide
lists
of
codes,
terminologies
and
Linked
Data
vocabularies
that
can
be
used
by
everyone.
A
key
point
is
to
make
sure
the
dataset,
or
its
documentation,
provides
enough
(human-
and
machine-readable)
context
so
that
data
consumers
can
retrieve
and
exploit
the
standardized
meaning
of
the
values.
In
the
context
of
the
Web,
using
unambiguous,
Web-based
identifiers
(URIs)
for
standardized
vocabulary
resources
is
an
efficient
way
to
do
this.
this,
noting
that
the
same
URI
may
have
multilingual
labels
attached
for
greater
cross-border
interoperability.
The
European
Union's
multilingual
thesaurus,
Eurovoc
provides
a
prime
example.
Using vocabulary repositories like the Linked Open Vocabularies repository or lists of services mentioned in technology-specific Best Practices such as the Best Practices for Publishing Linked Data [ LD-BP ], or the Core Initial Context for RDFa and JSON-LD , check that classes, properties, terms, elements or attributes used to represent a dataset do not replicate those defined by vocabularies used for other datasets.
Check if the terms or codes in the vocabulary to be used are defined in a standards development organization such as IETF, OGC & W3C etc., or are published by a suitable authority, such as a government agency.
Relevant requirements : R-MetadataStandardized , R-MetadataDocum , R-QualityComparable , R-VocabOpen , R-VocabReference
Best Practice 16: Choose the right formalization level
Opt for a level of formal semantics that fits both data and the most likely applications.
As Albert Einstein may or may not have said: everything should be made as simple as possible, but not simpler.
Formal semantics help to establish precise specifications that convey detailed meaning and using a complex vocabulary (ontology) may serve as a basis for tasks such as automated reasoning. On the other hand, such complex vocabularies require more effort to produce and understand, which could hamper their reuse, comparison and linking of datasets that use them.
If the data is sufficiently rich to support detailed research questions (the fact that A, B and C are true, and that D is not true, leads to the conclusion E) then something like an OWL Profile would clearly be appropriate [ OWL2-PROFILES ].
But there is nothing complicated about a list of bus stops.
Choosing a very simple vocabulary is always attractive but there is a danger: the drive for simplicity might lead the publisher to omit some data that provides important information, such as the geographical location of the bus stops that would prevent showing them on a map. Therefore, a balance has to be struck, remembering that the goal is not simply to share your data, but for others to reuse it.
The most likely application cases will be supported with no more complexity than necessary.
Look
at
what
your
peers
do
already.
It's
It
is
likely
you'll
you
will
see
that
there
is
a
commonly
used
vocabulary
that
matches,
or
nearly
matches,
your
current
needs.
That's
That
is
probably
the
one
to
use.
You
may
find
a
vocabulary
that
you'd
like
to
use
but
you
notice
a
semantic
constraint
that
makes
it
difficult
to
do
so,
such
as
a
domain
or
range
restriction
that
doesn't
does
not
apply
to
your
case.
In
that
scenario,
it's
it
is
often
worth
contacting
the
vocabulary
publisher
and
talking
to
them
about
it.
They
may
well
be
able
to
lift
that
restriction
and
provide
further
guidance
on
how
the
vocabulary
is
used
more
broadly.
W3C operates a mailing list at public-vocabs@w3.org [ archive ] where issues around vocabulary usage and development can be discussed.
If
you
are
creating
a
vocabulary
of
your
own,
keep
the
semantic
restrictions
to
the
minimum
that
works
for
you,
again,
so
as
to
increase
the
possibility
of
reuse
by
others.
As
an
example,
the
designers
of
the
(very
widely
used)
SKOS
ontology
itself
have
minimized
its
ontological
commitment
by
questioning
all
formal
axioms
that
were
suggested
for
its
classes
and
properties.
Often
they
were
rejected
because
their
use,
while
beneficial
to
many
applications,
would
have
created
formal
inconsistencies
for
the
data
from
other
applications,
making
SKOS
not
usable
at
all
for
these.
As
an
example,
the
property
skos:broader
was
not
defined
as
a
transitive
property,
even
though
it
would
have
fitted
the
way
hierarchical
links
between
concepts
are
created
for
many
thesauri
[
SKOS-DESIGN
].
Look
for
evidence
of
that
kind
of
"design
for
wide
use"
when
selecting
a
vocabulary.
Another
example
of
this
"design
for
wide
use"
can
be
seen
in
schema.org
.
Launched
in
June
2011,
schema.org
was
massively
adopted
in
a
very
short
time
in
part
because
of
its
informative
rather
than
normative
approach
for
defining
the
types
of
objects
that
properties
can
be
used
with.
For
instance,
the
values
of
the
property
author
are
only
"expected"
to
be
of
type
Organization
or
Person
.
author
"can
be
used"
on
the
type
CreativeWork
but
this
is
not
a
strict
constraint.
Again,
that
approach
to
design
makes
schema.org
a
good
choice
as
a
vocabulary
to
use
when
encoding
data
for
sharing.
This is almost always a matter of subjective judgment with no objective test. As a general guideline:
Relevant requirements : R-VocabReference , R-QualityComparable
Providing easy access to data on the Web enables both humans and machines to take advantage of the benefits of sharing data using the Web infrastructure. By default, the Web offers access using Hypertext Transfer Protocol (HTTP) methods. This provides access to data at an atomic transaction level. This might be through the simple bulk download of a file or, where data is distributed across multiple files or requires more sophisticated retrieval methods, through an API . The two basic methods, bulk download and API , are not mutually exclusive.
In the bulk download approach, data is generally pre-processed server side where multiple files or directory trees of files are provided as one downloadable file. When bulk data is being retrieved from non-file system solutions, depending on the data user communities, the data publisher can offer APIs to support a series of retrieval operations representing a single transaction.
For data that is generated in real time or near real time, data publishers should use an automated system to enable immediate access to time-sensitive data, such as emergency information, weather forecasting data, or system monitoring metrics. In general, APIs should be available to allow third parties to automatically search and retrieve such data.
Aside from helping to automate real-time data pipelines, APIs are suitable for all kinds of data on the Web. Though they generally require more work than posting files for download, publishers are increasingly finding that delivering a well documented, standards-based, stable API is worth the effort.
For some data publishers, it is important to know who has downloaded the data and how they have used it. There are two possible approaches to gathering this information. First, publishers can invite users to provide it, the user's motivation for doing so being that it encourages the continued publication of the data and promotes their own work. A second and less user-friendly approach is to require registration before data is accessed. In both cases, the Dataset Usage Vocabulary [ VOCAB-DUV ] provides a structure for representing such information. When collecting data from users, the publisher should explain why and how information gathered from users (either explicitly or implicitly) will be used. Without a clear policy users might be fearful of providing information and thus the value of the dataset is reduced.
Best Practice 17: Provide bulk download
Enable consumers to retrieve the full dataset with a single request.
When Web data is distributed across many URIs but might logically be organized as one container, accessing the data in bulk can be useful. Bulk access provides a consistent means to handle the data as one dataset. Individually accessing data over many retrievals can be cumbersome and, if used to reassemble the complete dataset, can lead to inconsistent approaches to handling the data.
Large file transfers that would require more time than a typical user would consider reasonable will be possible via dedicated file-transfer protocols.
Depending on the nature of the data and consumer needs, possible approaches could include the following:
The bulk download should include the metadata describing the dataset. Discovery metadata [ VOCAB-DCAT ] should also be available outside the bulk download.
Check if the full dataset can be retrieved with a single request.
Relevant requirements : R-AccessBulk
Best Practice 18: Provide Subsets for Large Datasets
If your dataset is large, enable users and applications to readily work with useful subsets of your data.
Large datasets can be difficult to move from place to place. It can also be inconvenient for users to store or parse a large dataset. Users should not have to download a complete dataset if they only need a subset of it. Moreover, Web applications that tap into large datasets will perform better if their developers can take advantage of “lazy loading”, working with smaller pieces of a whole and pulling in new pieces only as needed. The ability to work with subsets of the data also enables offline processing to work more efficiently. Real-time applications benefit in particular, as they can update more quickly.
Humans and applications will be able to access subsets of a dataset, rather than the entire thing, with a high ratio of needed to unneeded data for the largest number of users. Static datasets that users in the domain would consider to be too large will be downloadable in smaller pieces. APIs will make slices or filtered subsets of the data available, the granularity depending on the needs of the domain and the demands of performance in a Web application.
Consider the expected use cases for your dataset and determine what types of subsets are likely to be most useful. An API is usually the most flexible approach to serving subsets of data, as it allows customization of what data is transferred, making the available subsets much more likely to provide the needed data – and little unneeded data – for any given situation. The granularity should be suitable for Web application access speeds. (An API call that returns within one second enables an application to deliver interactivity that feels natural. Data that takes more than ten seconds to deliver will likely cause users to suspect failure.)
Another way to subset a dataset is to simply split it into smaller units and make those units individually available for download or viewing.
It can also be helpful to mark up a dataset so that individual sections through the data (or even smaller pieces, if expected use cases warrant it) can be processed separately. One way to do that is by indicating “slices” with the RDF Data Cube Vocabulary .
Check that the entire dataset can be recovered by making multiple requests that retrieve smaller units.
Relevant
requirements
:
R-Citable
,
R-GranularityLevels
,
R-UniqueIdentifier
,
R-AccessRealTime
,
R-GranularityLevels
Best Practice 19: Use content negotiation for serving data available in multiple formats
Use content negotiation in addition to file extensions for serving data available in multiple formats.
It is possible to serve data in an HTML page mixed with human-readable and machine-readable data, using RDFa for example. However, as the Architecture of the Web [ WEBARCH ] and DCAT [ VOCAB-DCAT ] make clear, a resource, such as a dataset, can have many representations. The same data might be available as JSON, XML, RDF, CSV and HTML. These multiple representations can be made available via and API but should be made available from the same URL using content negotiation to return the appropriate representation (what DCAT calls a distribution). Specific URIs can be used to identify individual representations of the data directly, by-passing content negotiation.
Content negotiation will enable different resources or different representations of the same resource to be served according to the request made by the client.
A possible approach to implementation is to configure the Web server to deal with content negotiation of the requested resource.
The specific format of the resource's representation can be accessed by the URI or by the Content-type of the HTTP Request.
Check the available representations of the resource and try to get them specifying the accepted content on the HTTP Request header.
Relevant requirements : R-FormatMachineRead , R-FormatMultiple
Best Practice 20: Provide real-time access
When data is produced in real time, make it available on the Web in real time or near real-time.
The presence of real-time data on the Web enables access to critical time sensitive data, and encourages the development of real-time Web applications. Real-time access is dependent on real-time data producers making their data readily available to the data publisher. The necessity of providing real-time access for a given application will need to be evaluated on a case by case basis considering refresh rates, latency introduced by data post processing steps, infrastructure availability, and the data needed by consumers. In addition to making data accessible, data publishers may provide additional information describing data gaps, data errors and anomalies, and publication delays.
Applications will be able to access time-critical data in real time or near real time , where real-time means a range from milliseconds to a few seconds after the data creation.
A
possible
approach
to
implementation
is
for
publishers
to
configure
a
Web
Service
that
provides
a
connection
so
as
real-time
data
is
received
by
the
web
service
Web
Service
it
can
be
instantly
made
available
to
consumers
by
polling
or
streaming.
If data is checked infrequently by consumers, real-time data can be polled upon consumer request for the most recent data through an API . The data publishers will provide an API to facilitate these read-only requests.
If data is checked frequently by consumers, a streaming data implementation may be more appropriate where data is pushed through an API . While streaming techniques are beyond the scope of this best practice, there are many standard protocols and technologies available (for example Server-sent Events, WebSocket, EventSourceAPI) for clients receiving automatic updates from the server.
Relevant requirements : R-AccessRealTime
Best Practice 21: Provide data up to date
Make data available in an up-to-date manner, and make the update frequency explicit.
The availability of data on the Web should closely match the data creation or collection time, perhaps after it has been processed or changed. Carefully synchronizing data publication to the update frequency encourages consumer confidence and data reuse.
Data on the Web will be updated in a timely manner so that the most recent data available online generally reflects the most recent data released via any other channel. When new data becomes available, it will be published on the Web as soon as practical thereafter.
New versions of the dataset can be posted to the Web on a regular schedule, following the Best Practices for Data Versioning . Posting to the Web can be made a part of the release process for new versions of the data. Making Web publication a deliverable item in the process and assigning an individual person as responsible for the task can help prevent data becoming out of date. To set consumer expectations for updates going forward, you can include human-readable text stating the expected publication frequency, and you can provide machine-readable metadata indicating the frequency as well.
Check that the update frequency is stated and that the most recently published copy on the Web is no older than the date predicted by the stated update frequency.
Relevant requirements : R-AccessUptodate
For data that is not available, provide an explanation about how the data can be accessed and who can access it.
Publishing
online
documentation
about
unavailable
data
due
to
sensitivity
issues
provides
a
means
for
publishers
to
explicitly
identify
knowledge
gaps.
This
provides
a
contextual
explanation
for
consumer
communities
thus
encouraging
use
of
the
data
that
is
available.
Consumers will know that data that is referred to from the current dataset is unavailable or only available under different conditions.
Depending
on
the
machine/human
context
there
are
a
variety
of
ways
to
indicate
data
unavailability.
Data
publishers
may
publish
an
HTML
document
that
gives
a
human-readable
explanation
for
data
unavailability.
From
a
machine
application
interface
perspective,
appropriate
HTTP
status
codes
with
customized
human
readable
human-readable
messages
can
be
used.
Examples
of
status
codes
include:
303
(see
other),
410
(permanently
removed),
503
(service
*providing
data*
unavailable).
Where the dataset includes references to data that is no longer available or is not available to all users, check that an explanation of what is missing and instructions for obtaining access (if possible) are given. Check if a legitimate http response code in the 400 or 500 range is returned when trying to get unavailable data.
Relevant requirements : R-AccessLevel , R-SensitivePrivacy , R-SensitiveSecurity
Best Practice 23: Make data available through an API
Offer an API to serve data if you have the resources to do so.
An API offers the greatest flexibility and processability for consumers of your data. It can enable real-time data usage, filtering on request, and the ability to work with the data at an atomic level. If your dataset is large, frequently updated, or highly complex, an API is likely to be the best option for publishing your data.
Developers will have programmatic access to the data for use in their own applications, with data updated without requiring effort on the part of consumers. Web applications will be able to obtain specific data by querying a programmatic interface.
Creating an API is a little more involved than posting data for download. It requires some understanding of how to build a Web application. One need not necessarily to build one from scratch, however. If you use a data management platform, such as CKAN, you may be able to enable an existing API . Many Web development frameworks include support for APIs, and there are also frameworks written specifically for building custom APIs.
Rails,
Django,
Rails
(Ruby),
Django
(Python),
and
Express
(NodeJS)
are
some
example
Web
development
frameworks
that
offer
support
for
building
APIs.
Examples
of
API
frameworks
include
OpenAPI,
Swagger,
Apigility,
Restify,
and
Restlet.
Check if a test client can simulate calls and the API returns the expected responses.
Relevant requirements : R-AccessRealTime , R-AccessUpToDate
Best Practice 24: Use Web Standards as the foundation of APIs
When designing APIs, use an architectural style that is founded on the technologies of the Web itself.
APIs that are built on Web standards leverage the strengths of the Web. For example, using HTTP verbs as methods and URIs that map directly to individual resources helps to avoid tight coupling between requests and responses, making for an API that is easy to maintain and can readily be understood and used by many developers. The statelessness of the Web can be a strength in enabling quick scaling, and using hypermedia enables rich interactions with your API .
Developers who have some experience with APIs based on Web standards, such as REST, will have an initial understanding of how to use the API . The API will also be easier to maintain.
REST (REpresentational State Transfer)[ Fielding ][ Richardson ] is an architectural style that, when used in a Web API , takes advantage of the architecture of the Web itself. A full discussion of how to build a RESTful API is beyond the scope of this document, but there are many resources and a strong community that can help in getting started. There are also many RESTful development frameworks available. If you are already using a Web development framework that supports building REST APIs, consider using that. If not, consider an API -only framework that uses REST.
Another aspect of implementation to consider is making a hypermedia API , one that responds with links as well as data. Links are what make the Web a web, and data APIs can be more useful and usable by including links in their responses. The links can offer additional resources, documentation, and navigation. Even for an API that does not meet all the constraints of REST, returning links in responses can make for a service that is rich and self-documenting.
Check that the service avoids using http as a tunnel for calls to custom methods, and check that URIs do not contain method names.
Relevant requirements : R-APIDocumented , R-UniqueIdentifier
Best Practice 25: Provide complete documentation for your API
Provide complete information on the Web about your API . Update documentation as you add features or make changes.
Developers are the primary consumers of an API and the documentation is the first clue about its quality and usefulness. When API documentation is complete and easy to understand, developers are probably more willing to continue their journey to use it. Providing comprehensive documentation in one place allows developers to code efficiently. Highlighting changes enables your users to take advantage of new features and adapt their code if needed.
Developers will be able to obtain detailed information about each call to the API , including the parameters it takes and what it is expected to return, i.e., the whole set of information related to the API . The set of values — how to use it, notices of recent changes, contact information, and so on — should be described and easily browsable on the Web. It will also enables machines to access the API documentation in order to help developers build API client software.
A
typical
API
reference
provides
a
comprehensive
list
of
the
calls
the
API
can
handle,
describing
the
purpose
of
each
one,
detailing
the
parameters
it
allows
and
what
it
returns,
and
giving
one
or
more
examples
of
its
use.
One
nice
trend
in
API
documentation
is
to
provide
a
form
in
which
developers
can
enter
specific
calls
for
testing,
to
see
what
the
API
returns
for
their
use
case.
There
are
now
tools
available
for
quickly
creating
this
type
of
documentation,
such
as
Swagger
,
io-docs
,
OpenAPI
OpenApis
,
and
others.
It
is
important
to
say
that
the
API
should
be
self-documenting
as
well,
so
that
calls
return
helpful
information
about
errors
and
usage.
API
users
should
be
able
to
contact
the
maintainers
with
questions,
suggestions,
or
bug
reports.
The quality of documentation is also related to usage and feedback from developers. Try to get constant feedback from your users about the documentation.
Check that every call enabled by your API is described in your documentation. Make sure you provide details of what parameters are required or optional and what each call returns.
Check the Time To First Successful Call (i.e. being capable of doing a successful request to the API within a few minutes will increase the chances that the developer will stick to your API ).
Relevant requirements : R-APIDocumented
Best Practice 26: Avoid Breaking Changes to Your API
Avoid changes to your API that break client code, and communicate any changes in your API to your developers when evolution happens.
When developers implement a client for your API , they may rely on specific characteristics that you have built into it, such as the schema or the format of a response. Avoiding breaking changes in your API minimizes breakage to client code. Communicating changes when they do occur enables developers to take advantage of new features and, in the rare case of a breaking change, take action.
Developer code will continue to work. Developers will know of improvements you make and be able to make use of them. Breaking changes to your API will be rare, and if they occur, developers will have sufficient time and information to adapt their code. That will enable them to avoid breakage, enhancing trust. Changes to the API will be announced on the API 's documentation site.
When improving your API , focus on adding new calls or new options rather than changing how existing calls work. Existing clients can ignore such changes and will continue functioning.
If using a fully RESTful style, you should be able to avoid changes that affect developers by keeping resource URIs constant and changing only elements that your users do not code to directly. If you need to change your data in ways that are not compatible with the extension points that you initially designed, then a completely new design is called for, and that means changes that break client code. In that case, it’s best to implement the changes as a new REST API , with a different resource URI.
If using an architectural style that does not allow you to make moderately significant changes without breaking client code, use versioning. Indicate the version in the response header. Version numbers should be reflected in your URIs or in request "accept" headers (using content negotiation). When versioning in URIs, include the version number as far to the left as possible. Keep the previous version available for developers whose code has not yet been adapted to the new version.
To
notify
users
directly
of
changes,
it's
it
is
a
good
idea
to
create
a
mailing
list
and
encourage
developers
to
join.
You
can
then
announce
changes
there,
and
this
provides
a
nice
mechanism
for
feedback
as
well.
It
also
allows
your
users
to
help
each
other.
Release changes initially to a test version of your API before applying them to the production version. Invite developers to test their applications on the test version and provide feedback.
Relevant requirements : R-PersistentIdentification , R-APIDocumented
The working group recognizes that it is unrealistic to assume that all data on the Web will be available on demand at all times into the indefinite future. For a wide variety of reasons, data publishers are likely to want or need to remove data from the live Web, at which point it moves out of scope for the current work and into the scope of data archivists. What is in scope here, however, is what is left behind, that is, what steps should publishers take to indicate that data has been removed or archived. Simply deleting a resource from the Web is bad practice. In that circumstance, dereferencing the URI would lead to an HTTP Response code of 404 that tells the user nothing other than that the resource was not found. The following Best Practices offer more productive approaches.
Best Practice 27: Preserve identifiers
When removing data from the Web, preserve the identifier and provide information about the archived resource.
URI dereferencing is the primary interface to data on the Web. If dereferencing a URI leads to the infamous 404 response code (Not Found), the user will not know whether the lack of availability is permanent or temporary, planned or accidental. If the publisher, or a third party, has archived the data, that archived copy is much less likely to be found if the original URI is effectively broken.
The
URI
of
a
dataset
resource
will
always
dereference
to
the
dataset
resource
or
redirect
to
information
about
it.
There are two scenarios to consider:
In the first of these cases, the server should be configured to respond with an HTTP Response code of 410 (Gone) . From the specification:
The 410 response is primarily intended to assist the task of Web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed.
In the second case, where data has been archived, it is more appropriate to redirect requests to a Web page giving information about the archive that holds the data and how a potential user can access it.
In
both
cases,
the
original
URI
continues
to
identify
the
dataset
resource
and
leads
to
useful
information,
even
though
that
dataset
the
data
is
no
longer
directly
available.
Check that dereferencing the URI of a dataset that is no longer available returns information about its current status and availability, using either a 410 or 303 Response Code as appropriate.
Relevant requirements : R-AccessLevel , R-PersistentIdentification
Best Practice 28: Assess dataset coverage
Assess the coverage of a dataset prior to its preservation.
A chunk of Web data is by definition dependent on the rest of the global graph. This global context influences the meaning of the description of the resources found in the dataset. Ideally, the preservation of a particular dataset would involve preserving all its context. That is the entire Web of Data.
At the time of archiving, an evaluation of the linkage of the dataset dump to already preserved resources, and the vocabularies used, needs to be assessed. Datasets for which very few of the vocabularies used and/or resources pointed to are already preserved somewhere should be flagged as being at risk.
Users will be able to make use of archived data well into the future.
Check whether all the resources used are either already preserved somewhere or need to be provided along with the dataset being considered for preservation.
It is impossible to determine what will be available in, say, 50 years' time. However, one can check that an archived dataset depends only on widely used external resources and vocabularies. Check that unique or lesser-used dependencies are preserved as part of the archive.
Relevant requirements : R-VocabReference
Publishing on the Web enables data sharing on a large scale to a wide range of audiences with different levels of expertise. Data publishers want to ensure that the data published is meeting the data consumer needs and for this purpose, user feedback is crucial. Feedback has benefits for both publishers and consumers, helping data publishers to improve the integrity of their published data, as well as encouraging the publication of new data. Feedback allows data consumers to have a voice describing usage experiences (e.g. applications using data), preferences and needs. When possible, feedback should also be publicly available for other data consumers to examine. Making feedback publicly available allows users to become aware of other data consumers, supports a collaborative environment, and allows user community experiences, concerns or questions are currently being addressed.
From a user interface perspective there are different ways to gather feedback from data consumers, including site registration, contact forms, quality ratings selection, surveys and comment boxes for blogging. From a machine perspective the data publisher can also record metrics on data usage or information about specific applications that use the data. Feedback such as this establishes a communication channel between data publishers and data consumers. Publicly available feedback should be displayed in a human-readable form.
This section provides some Best Practices to be followed by data publishers in order to enable consumers to provide feedback. This feedback can be for humans or machines.
Best Practice 29: Gather feedback from data consumers
Provide a readily discoverable means for consumers to offer feedback.
Obtaining feedback helps publishers understand the needs of their data consumers and can help them improve the quality of their published data. It also enhances trust by showing consumers that the publisher cares about addressing their needs. Specifying a clear feedback mechanism removes the barrier of having to search for a way to provide feedback.
Data consumers will be able to provide feedback and ratings about datasets and distributions.
Provide
data
consumers
with
one
or
more
feedback
mechanisms
including,
but
not
limited
to,
a
contact
form,
point
and
click
data
quality
rating
buttons,
or
a
comment
box.
In
order
to
make
the
most
of
feedback
received
from
consumers,
it's
it
is
a
good
idea
to
collect
the
feedback
with
a
tracking
system
that
captures
each
item
in
a
database,
enabling
quantification
and
analysis.
It
is
also
a
good
idea
to
capture
the
type
of
each
item
of
feedback,
i.e.,
its
motivation
(editing,
classifying
[rating],
commenting
or
questioning),
so
that
each
item
can
be
expressed
using
the
Dataset
Usage
Vocabulary
[
VOCAB-DUV
].
Check that at least one feedback mechanism is provided and readily discoverable by data consumers.
Relevant requirements : R-UsageFeedback , R-QualityOpinions
Best Practice 30: Make feedback available
Make consumer feedback about datasets and distributions publicly available.
By sharing feedback with consumers, publishers can demonstrate to users that their concerns are being addressed, and they can avoid submission of duplicate bug reports. Sharing feedback also helps consumers understand any issues that may affect their ability to use the data, and it can foster a sense of community among them.
Consumers will be able to assess the kinds of errors that affect the dataset, review other users' experiences with it, and be reassured that the publisher is actively addressing issues as needed. Consumers will also be able to determine whether other users have already provided similar feedback, saving them the trouble of submitting unnecessary bug reports and sparing the maintainers from having to deal with duplicates.
Feedback can be available as part of an HTML Web page, but it can also be provided in a machine-readable format using the Dataset Usage Vocabulary [ VOCAB-DUV ].
Check that any feedback given by data consumers for a specific dataset or distribution is publicly available.
Relevant requirements : R-UsageFeedback , R-QualityOpinions
Data enrichment refers to a set of processes that can be used to enhance, refine or otherwise improve raw or previously processed data. This idea and other similar concepts contribute to making data a valuable asset for almost any modern business or enterprise. It is a diverse topic in itself, details of which are beyond the scope of the current document. However, it is worth noting that some of these techniques should be approached with caution, as ethical concerns may arise. In scientific research, care must be taken to avoid enrichment that distorts results or statistical outcomes. For data about individuals, privacy issues may arise when combining datasets. That is, enriching one dataset with another, when neither contains sufficient information about any individual to identify them, may yield a combined dataset that compromises privacy. Furthermore, these techniques can be carried out at scale, which in turn highlights the need for caution.
This section provides some advice to be followed by data publishers in order to enrich data.
Best Practice 31: Enrich data by generating new data
Enrich your data by generating new data when doing so will enhance its value.
Enrichment can greatly enhance processability, particularly for unstructured data. Under some circumstances, missing values can be filled in, and new attributes and measures can be added from the existing raw data. Datasets can also be enriched by gathering additional results in the same fashion as the original data, or by combining the original data with other datasets. Publishing more complete datasets can enhance trust, if done properly and ethically. Deriving additional values that are of general utility saves users time and encourages more kinds of reuse. There are many intelligent techniques that can be used to enrich data, making the dataset an even more valuable asset.
Datasets with missing values will be enhanced by filling in those values. Structure will be conferred and utility enhanced if relevant measures or attributes are added, but only if the addition does not distort analytical results, significance, or statistical power.
Techniques for data enrichment are complex and go well beyond the scope of this document, which can only highlight the possibilities.
Machine learning can readily be applied to the enrichment of data. Methods include those focused on data categorization, disambiguation, entity recognition, sentiment analysis and topification, among others. New data values may be derived as simply as performing a mathematical calculation across existing columns. Other examples include visual inspection to identify features in spatial data and cross-reference to external databases for demographic information. Lastly, generation of new data may be demand-driven, where missing values are calculated or otherwise determined by direct means.
Values generated by inference-based techniques should be labeled as such, and it should be possible to retrieve any original values replaced by enrichment.
Whenever licensing permits, the code used to enrich the data should be made available along with the dataset. Sharing such code is particularly important for scientific data.
Prioritization of enrichment activities should be based on value to the data consumer as well as the effort required. Value to the consumer can be gauged by measurement of demand (e.g., through surveys or tracking direct requests). Documenting how you measure demand can make the increased value demonstrable.
If you make enrichments to someone else’s data, it’s a good idea to offer those enrichments back to the original publisher.
Verify that there are no missing values in the dataset, or additional fields likely to be needed by others, that could readily be provided. Check that any data added by inferential enrichment techniques is identified as such and that any replaced data is still available.
Relevant requirements: R-DataEnrichment , R-FormatMachineRead , R-ProvAvailable
Best Practice 32: Provide Complementary Presentations
Enrich data by presenting it in complementary, immediately informative ways, such as visualizations, tables, Web applications, or summaries.
Data published online is meant to inform others about its subject. But only posting datasets for download or API access puts the burden on consumers to interpret it. The Web offers unparalleled opportunities for presenting data in ways that let users learn and explore without having to create their own tools.
Complementary data presentations will enable human consumers to have immediate insight into the data by presenting it in ways that are readily understood.
One very simple way to provide immediate insight is to publish an analytical summary in an HTML page. Including summative data in graphs or tables can help users scan the summary and quickly understand the meaning of the data.
If you have the means to create interactive visualizations or Web applications that use the data, you can give consumers of your data greater ability to understand it and discover patterns in it. These approaches also demonstrate its suitability for processing and encourage reuse.
Check that the dataset is accompanied by some additional interpretive content that can be perceived without downloading the data or invoking an API .
Relevant requirements: R-DataEnrichment
Reusing
data
is
another
way
of
publishing
data;
it's
it
is
simply
republishing.
It
can
take
the
form
of
combining
existing
data
with
other
datasets,
creating
Web
applications
or
visualizations,
or
repackaging
the
data
in
a
new
form,
such
as
a
translation.
Data
republishers
have
some
responsibilities
that
are
unique
to
that
form
of
publishing
on
the
Web.
This
section
provides
advice
to
be
followed
when
republishing
data.
Best Practice 33: Provide Feedback to the Original Publisher
Let the original publisher know when you are reusing their data. If you find an error or have suggestions or compliments, let them know.
Publishers generally want to know whether the data they publish has been useful. Moreover, they may be required to report usage statistics in order to allocate resources to data publishing activities. Reporting your usage helps them justify putting effort toward data releases. Providing feedback repays the publishers for their efforts by directly helping them to improve their dataset for future users.
Better communication will make it easier for original publishers to determine how the data they post is being used, which in turn helps them justify publishing the data. Publishers will also be made aware of steps they can take to improve their data. This leads to more and better data for everyone.
When you begin using a dataset in a new product, make a note of the publisher’s contact information, the URI of the dataset you used, and the date on which you contacted them. This can be done in comments within your code where the dataset is used. Follow the publisher’s preferred route to provide feedback. If they do not provide a route, look for contact information for the Web site hosting the data.
Check that you have a record of at least one communication informing the publisher of your use of the data.
Relevant requirements: R-TrackDataUsage , R-UsageFeedback , R-QualityOpinions
Best Practice 34: Follow Licensing Terms
Find and follow the licensing requirements from the original publisher of the dataset.
Licensing provides a legal framework for using someone else’s work. By adhering to the original publisher’s requirements, you keep the relationship between yourself and the publisher friendly. You don’t need to worry about legal action from the original publisher if you are following their wishes. Understanding the initial license will help you determine what license to select for your reuse.
Data publishers will be able to trust that their work is being reused in accordance with their licensing requirements, which will make them more likely to continue to publish data. Reusers of data will themselves be able to properly license their derivative works.
Read the original license and adhere to its requirements. If the license calls for specific licensing of derivative works, choose your license to be compatible with that requirement. If no license is given, contact the original publisher and ask what the license is.
Read through the original license and check that your use of the data does not violate any of the terms.
Relevant requirements: R-LicenseAvailable , R-LicenseLiability ,
Best Practice 35: Cite the Original Publication
Acknowledge the source of your data in metadata. If you provide a user interface, include the citation visibly in the interface.
Data is only useful when it is trustworthy. Identifying the source is a major indicator of trustworthiness in two ways: first, the user can judge the trustworthiness of the data from the reputation of the source, and second, citing the source suggests that you yourself are trustworthy as a republisher. In addition to informing the end user, citing helps publishers by crediting their work. Publishers who make data available on the Web deserve acknowledgment and are more likely to continue to share data if they find they are credited. Citation also maintains provenance and helps still others to work with the data.
End users will be able to assess the trustworthiness of the data they see and the efforts of the original publishers will be recognized. The chain of provenance for data on the Web will be traceable back to its original publisher.
You can present the citation to the original source in a user interface by providing bibliographic text and a working link.
Check that the original source of any reused data is cited in the metadata provided. Check that a human-readable citation is readily visible in any user interface.
Relevant requirements: R-Citable , R-ProvAvailable , R-MetadataAvailable , R-TrackDataUsage
This section is non-normative.
A Citation may be either direct and explicit (as in the reference list of a journal article), indirect (e.g. a citation to a more recent paper by the same research group on the same topic), or implicit (e.g. as in artistic quotations or parodies, or in cases of plagiarism).
Data Archiving is the set of practices around the storage and monitoring of the state of digital material over the years.
These tasks are the responsibility of a Trusted Digital Repository (TDR), also sometimes referred to as Long-Term Archive Service (LTA) . Often such services follow the Open Archival Information System [ OAIS ] which defines the archival process in terms of ingest, monitoring and reuse of data.
For the purposes of this WG, a Data Consumer is a person or group accessing, using, and potentially performing post-processing steps on data.
From: Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.
Data Format defined as a specific convention for data representation i.e. the way that information is encoded and stored for use in a computer system, possibly constrained by a formal data type or set of standards."
Data Preservation is defined by the Alliance for Permanent Access Network as "The processes and operations in ensuring the technical and intellectual survival of objects through time". This is part of a data management plan focusing on preservation planning and meta-data . Whether it is worthwhile to put effort into preservation depends on the (future) value of the data, the resources available and the opinion of the designated community of stakeholders.
Data Producer is a person or group responsible for generating and maintaining data.
From: Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.
Provenance originates from the French term "provenir" (to come from), which is used to describe the curation process of artwork as art is passed from owner to owner. Data provenance, in a similar way, is metadata that allows data providers to pass details about the data history to data users.
Data quality is commonly defined as “fitness for use” for a specific application or use case.
A dataset is defined as a collection of data, published or curated by a single agent, and available for access or download in one or more formats. A dataset does not have to be available as a downloadable file.
From: Data Catalog Vocabulary (DCAT) [ VOCAB-DCAT ]
A
distribution
represents
a
specific
available
form
of
a
dataset.
Each
dataset
might
be
available
in
different
forms,
forms;
these
forms
might
represent
different
formats
of
the
dataset
or
different
endpoints.
Examples
of
distributions
include
a
downloadable
CSV
file,
an
API
or
an
RSS
feed
From: Data Catalog Vocabulary (DCAT) [ VOCAB-DCAT ]
A feedback forum is used to collect messages posted by consumers about a particular topic. Messages can include replies to other consumers. Datetime stamps are associated with each message and the messages can be associated with a person or submitted anonymously.
From: Semantically-Interlinked Online Communities ( SIOC ) and the Annotation Model [ Annotation-Model ]
To better understand why an annotation was created, a SKOS Concept Scheme [ SKOS-PRIMER ] is used to show inter-related annotations between communities with more meaningful distinctions than a simple class/subclass tree.
File Format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.
Examples
of
file
formats
include:
plain
text
(in
a
specified
character
encoding,
ideally
UTF-8),
Comma
Separated
Variable
(CSV)
[
RFC4180
],
Portable
Document
Format
(
[
PDF
)
],
[
ISO
32000
],
XML
,
JSON
[
RFC4627
],
Turtle
[
Turtle
]
and
HDF5
.
A license is a legal document giving official permission to do something with the data with which it is associated.
A collection of international preferences, generally related to a language and geographic region that a (certain category) of users require. These are usually identified by a shorthand identifier or token, such as a language tag, that is passed from the environment to various processes to get culturally affected behavior
From Language Tags and Locale Identifiers for the World Wide Web [ LTLI ].
Machine-readable data is data in a standard format that can be read and processed automatically by a computing system. Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret and manipulate. Formats such as XML, JSON, HDF5, RDF and CSV are machine-readable data formats
Adapted from Wikipedia .
The term "near real-time" or "nearly real-time" (NRT), in telecommunications and computing, refers to the time delay introduced, by automated data processing or network transmission, between the occurrence of an event and the use of the processed data, such as for display or feedback and control purposes. For example, a near-real-time display depicts an event or situation as it existed at the current time minus the processing time, as nearly the time of the live event.
From: Wikipedia
Sensitive data is any designated data or metadata that is used in limited ways and/or intended for limited audiences. Sensitive data may include personal data, corporate or government data, and mishandling of published sensitive data may lead to damages to individuals or organizations.
A technical standard is an established norm or requirement in regard to technical systems. It is usually a formal document that establishes uniform engineering or technical criteria, methods, processes and practices. In contrast, a custom, convention, company product, corporate standard, etc. that becomes generally accepted and dominant is often called a de facto standard.
From: Wikipedia
Structured Data refers to data that conforms to a fixed schema. Relational databases and spreadsheets are examples of structured data.
Vocabulary is A collection of "terms" for a particular purpose. Vocabularies can range from simple such as the widely used RDF Schema [ RDF-SCHEMA ], FOAF [ FOAF ] and Dublin Core [ DCTERMS ] to complex vocabularies with thousands of terms, such as those used in healthcare to describe symptoms, diseases and treatments. Vocabularies play a very important role in Linked Data, specifically to help with data integration. The use of this term overlaps with Ontology.
From: Linked Data Glossary
This section is non-normative.
The following diagram summarizes some of the main challenges faced when publishing or consuming data on the Web. These challenges were identified from the DWBP Use Cases and Requirements [ DWBP-UCR ] and, as presented in the diagram, is addressed by one or more Best Practices.
This section is non-normative.
The list below describes the main benefits of applying the DWBP . Each benefit represents an improvement in the way how datasets are available on the Web.
The following table relates Best Practices and Benefits.
The figure below shows the benefits that data publishers will gain with adoption of the Best Practices.
Reuse
All Best Practices
Access
Discoverability
Processability
Trust
Interoperability
Linkability
Comprehension
This section is non-normative.
The editors gratefully acknowledge the contributions made to this document by all members of the working group. Especially Annette Greiner's great effort and the contributions received from Antoine Isaac, Eric Stephan and Phil Archer.
This document has benefited from inputs from many members of the Spatial Data on the Web Working Group. Specific thanks are due to Andrea Perego, Dan Brickley, Linda van den Brink and Jeremy Tandy.
The editors would also like to thank comments received from Addison Phillips, Adriano Machado, Adriano Veloso, Andreas Kuckartz, Augusto Herrmann, Bart van Leeuwen, Christophe Gueret, Erik Wilde, Giancarlo Guizzardi, Gisele Pappa, Gregg Kellogg, Herbert Van de Sompel, Ivan Herman, Leigh Dodds, Lewis John McGibbney, Makx Dekkers, Manuel Tomas Carrasco-Benitez, Maurino Andrea, Michel Dumontier, Nandana Mihindukulasooriya, Nathalia Sautchuk Patrício, Peter Winstanley, Renato Iannella, Steven Adler, Vagner Diniz and Wagner Meira.
The editors also gratefully acknowledge the chairs of this Working Group: Deirdre Lee, Hadley Beeman, Yaso Córdova and the staff contact Phil Archer.
Changes since the previous version include: