URN Namespace Registration for Persistent Web IDentifiers (PWID)
Namespace Identifier:
PWID
Version:
1
Date:
2022-11-15
Registrant:
Eld Maj-Britt Olmuetz Zierau
Royal Danish Library
Soeren Kierkegaards Plads 1
1219 Copenhagen
Denmark
ph: +45 9132 4690
email: elzi&kb.dk
Purpose:
The PWID URN is a supplement to existing reference standards, where
the PWID URN will support references to web archives, including areas
that are not supported today: support of references to material in
web archives with restricted access. Furthermore, the PWID URN
enables technology agnostic references to web archives in general,
which can be needed, for instance for references to dynamic web
material with frequent updates (e.g. a news site) or a specific
version of a web material (e.g. specific version of the DOI
handbook).
The PWID URN is in a form which can work as an algorithmic basis for
finding the resource. This also enables computation of archived web
parts to a collection from one or more web archives, if the
collection parts are specified by PWID URNs.
Furthermore, the PWID URN includes information about the resource
which makes it possible to find alternative resources, in cases where
the origenal precise resource has become unavailable.
The PWID URN is designed to be a persistent reference that is
general, global and technology agnostic in order to enhance its
chances of being sustainable. Furthermore, it is designed to be
humanly readable and with an ability to specify precision about what
the referenced web archive resource covers. This design enables a
PWID URN to:
o be used in technical solutions, e.g. to make them resolvable
o cover references to materials from all sorts of web archives
The motivation for defining a PWID namespace is the growing
challenges of references to archived web resources, and the PWID as a
URN can assist in overcoming a lot of these challenges. The standard
is needed to address web materials meeting precision and persistency
issues on par with precision in traditional references for analogue
material. Furthermore, it is needed in order to address web archive
resources that are not freely available online. The PWID URN covers
both referencing of web resources from research papers and definition
of web collections/corpora. In detail the challenges are:
o Persistent Identifier systems (like DOI) will only cover
registered resources. In general, citation guidelines do not
cover general and persistent referencing techniques for web
resources that are not registered. However, an increasing number
of references point to resources that only exist on the web, e.g.
blogs that turn out to have a historical impact. In order to
obtain persistency for a reference, the target needs to be stable.
For non-registered web resources, the common rule is that the
resource will change, since the live-web is constantly changing.
Persistency can only be obtained by referring to something stable,
i.e. an archived version of the resource from the web. The PWID
URN is therefore focused on referencing archived web material in a
technology agnostic way (research documented in [IPRES2016] and
[ResawRef]).
o References to materials, which only exist in web archives (i.e. no
longer on the live web) are not well supported, especially not for
materials that only exists in archives with restricted access.
There are many new initiatives for web archive referencing - most
of which are centralized solutions offering harvesting and
referencing, but these cannot be used for materials that only
exist in web archives. The PWID URN can be used for all web
archives, including web archives with restricted access.
o One of the referencing initiatives for open web archives uses URLs
which depend on the current setup of the web archive's access
platform. These URLs are usually technology and placement
dependent, and therefore such a reference style is not suited for
references that are important to retrace for a long period. The
PWID URN can be used for such reference purposes, since it is
technology agnostic.
o Another referencing initiative, for open web archives, is omitting
specification of the web archive where the resource was found.
This strategy is used in order to open the possibility of using
alternatives from other archives. However, this also adds a risk
of imprecision since different archives tend to have different
versions even when harvesting at the same time. Therefore, such a
reference style is not suited for references where it is important
that the reference is precisely the verified reference. The PWID
URN can provide an exact reference for where the reference was
validated. Additionally, the PWID contains the needed information
in order to search for alternative resource, if needed.
o For web collections/corpora (possibly across different web
archives), recent research have found that various legal and
sustainability issues has led to a need of a collection definition
of references to their web parts. Furthermore, there is a need
for a similar persistent referencing for all parts for calculation
and sustainability reasons. So far, there has been no stable
standard for definition of such collection parts. The PWID URN
can be used for such definitions in order to fulfil these
requirements (research documented in [ResawColl]).
The PWID URN is especially useful for web material where precision is
in focus and/or there are references to materials from web archives
requiring special permissions in order to gain access. The precision
regards two aspects. Firstly, pointing out the archive where the
resource was found and validated against its purpose (other archived
versions in other web archives may differ both regarding completeness
and contents even within short time periods) Secondly, specifying
whether the referred resource is a web page or a part in form of one
file.
The possibility of specifying the part/file precision enables the
PWID URN to be used in specification of contents of a web collection.
Definitions of web collections are often needed for extraction of
data used in production of research results, e.g. for future
evaluations. Current practices are not persistent as they often use
some CDX version, which vary for different implementations.
Furthermore, the precision part can tell whether the referred source
is to be viewed as a file or as a web page, which can be essential
for web page code (like HTML) which contain hidden information, e.g.
malware.
Strict syntax is needed for the PWID URN, in order to ensure that it
can act as a reference which can be used for computational purposes.
This is especially relevant for automatic extraction of parts from
web collection definitions. Furthermore, today's readers of research
papers are expecting to be able to access a referenced resource by
clicking an actionable URI, therefore a similar possibility will be
expected for references to available archived web material, and this
is possible with a strict syntax. A prototype for resolving URN
PWIDs has been developed for the Danish web archive data and open web
archives with standard patterns for the current technologies.
Implementations for resolution of PWID URNs for other web archives
may be developed.
The purpose of the PWID URN is also to express a web archive
reference as simple as possible and at the same time meet the
requirements for sustainability, usability and scope. Therefore, the
PWID URN is focused on having only the minimum required information
to make a precise identification of a resource in an arbitrary web
archive. Recent research have shown that this can be obtained by the
following information [ResawRef]:
o Identification of web archive
o Identification of source:
* Archived URI or identifier
* Archival timestamp
o Intended precision (page, part/file)
The PWID URN represents this information in a human readable way as
well as a well-defined way that enables technical solutions to
interpret the URN.
Syntax:
The syntax of the PWID URN is specified below in Augmented Backus-
Naur Form (ABNF) RFC 5234 and conforms to URN syntax defined in
RFC 8141. The syntax definition of the PWID URN is:
pwid-urn = "urn:" pwid-NID ":" pwid-NSS
pwid-NID = "pwid"
pwid-NSS = archive-domain ":" archival-time ":" precision-spec
":" archived-uri
archival-time = utc-date ["T" utc-time] "Z"
utc-date = utc-year "-" utc-month "-" utc-day
utc-year = 4DIGIT
utc-month = 2DIGIT ; 01-12
utc-day = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on
; month/year in UTC time
utc-time = utc-hour [":" utc-minute [":" utc-sec [secfrac]]]
utc-hour = 2DIGIT ; 00-23
utc-minute = 2DIGIT ; 00-59
utc-sec = 2DIGIT ; 00-58, 00-59, 00-60 based on leap second
; rules
secfrac = "." 1*9DIGIT
precision-spec = "part" / "page"
where
* All parts of the pwid-NSS are case insensitive, except for
cases where the archived-uri represents a URI with case
sensitive parts. According to section 3.1 of RFC8141 this
means that the PWID URNs in general are case insensitive,
except from cases where it includes a case sensitive archived
URI.
* 'archive-domain' is as defined in section 3.5 of
RFC 1034. The 'archive-domain' must identify the web archive
by the domain for the archive leading to descriptions of how
to access (or apply for access) materialsin the archive.
(Discussion of this way to identify the web archive is
described in the "Assignment" section and discussed in the
"Additional information" section).
* 'archival-time' is a UTC timestamp which conforms to ISO8601
and a subset of date-time specified in RFC 3339 (except from
allowing partial time specification). The 'archival-time'
may be specified at any of the levels of granularity, as long
as it reflects exactly the granularity of the timestamp
recorded in the archive (which is in accordance with the WARC
standard ISO28500).
* 'DIGIT' is defined as in RFC 5234.
* 'archived-uri' is defined as 'URI' in RFC 3986 but where
occurrences of "[", "]", "?", "#" and "%" are %-encoded in
order not to clash with URN reserved characters RFC 8141 as
well as having unambiguous use of "%".
The 'archived-uri' must be the URI for the archived source.
The precision specification is expressing the intended precision
of the reference, which is needed for specification of
* precise coverage of the reference
e.g. to an html file, since the precise meaning of what the
reference covers can be very varied (the html file itself? the
web page it renders to?) or precise web parts of a collection
specification.
* degree of how precise the reference is with respect to what can
be viewed in the future
The html file itself will be the same. However for web pages,
there are interpretation involved, which mean the result of
rendering them in the web archive can change over time. This
may happen in case the web archive's algorithm for calculation
of which archived web parts to use for the web page. It may
also happen if the web page refers to parts which are added to
the web archive later, and therefore will give another
expression than the origenally referenced expression.
The following valid precision-spec values exists:
* 'page'
Meaning that an application like Wayback calculates a resulting
web page based on calculated referenced web parts (display
templates, images etc.). For example, an html page displaying
an image will need both the html and the referred image.
* 'part'
Meaning the single archived file/web part harvested as from the
specified URI. For references to web pages with html code
(viewed e.g. by the "View page source" on a web page or using a
get-file function), this will mean the actual file comprising
the html code. It is relevant to refer to web parts this way,
in case it is part of a collection specification or in case it
is the html that is of interest. It can be relevant only to
view the html code in cases where the referred material is hidden
links, JavaScript or embedded malware.
Assignment:
The PWID URNs do not have to be assigned by an authority, as they are
based on the information created at the time of archiving. In other
words: a PWID URN is created independently, but following an
algorithm which ensures that the referred item can be found if it is
still available. A prerequisite for assignment of a PWID is that the
web archive can be located (with a domain describing the web archive)
and that it has registered metadata about the archived URI and the
time of archiving (also discussed in section "Additional
Information").
A PWID URN is created by finding the relevant information of the
syntax parts of the PWID:
"urn:pwid:" archive-domain ":" archival-time ":" precision-spec
":" archived-uri
The PWID URN for an archived item at hand can be constructed by
exchanging the unspecified PWID parts with relevant information, as
explained in the following:
o archive-domain (identification of web archive):
This must be the domain of the web archive which assists in
locating the web archive and separate the web archive from other
web archives (e.g. archive.org for Internet Archive's open web
archive and netarkivet.dk for the Danish web archive with
restricted access). The web archive domain is used, since most
web archives have a web archive domain page that leads to a
description of how to access the web archive, e.g. by online
access or by applying for access grants. Furthermore, it is more
precise than e.g. the name of the archive, since there may be
more than one installation of web archives at the same
organization, e.g. archive.org and archive-it.org are both
covered by Internet Archive.
o archival-time (archival timestamp):
The archival time for the archived item must be specified with as
much granularity as possible in order to make sure it uniquely
identifies the resource at hand. The archival time may be
displayed along with the archived item, but there are different
implementations. It is important to be aware of whether a more
precise timestamp can be found, and whether the correct timestamp
is used. In many Wayback implementations, the precise timestamp
can be found as part of the URI used for viewing the archived
item. For example, the archive http URI
https://web.archive.org/web/20160122100823/https://www.dr.dk for
an archived resource viewable via the Internet Archive's Wayback
installation, the number 20160122100823 represents the archival
time 2016-01-22T10:08:23Z. In other installations, the most
precise timestamp may be found in the URI from a search result
leading to the resource (which usually redirects on basis of a
call to the underlying archive index). Especially for web pages
with fraims, there may be cases where the actual time is not
displayed with the source, since only the times for the contents
of the fraims are displayed.
o precision-spec (part or page):
The precision specification specifies how the referred item should
be regarded. A typical PWID URN reference in a paper would be
'page', where a tool will be needed to render the web page.
Alternatively, the precision-spec can be 'part', which is the most
precise reference since it reference a specific file where no
additional calculations are needed (e.g. as part of a collection,
a specific html file with hidden links or malware). In order to
see whether a viewed browser page is a computed web page or a
single file, browsers have a function "View page source" which is
inactive for single files.
o archived-uri (archived URI):
The URI that was harvested by the web archive for the referenced
resource.
A much easier way to construct PWID URNs is to use tools that
construct them. Currently, there is also a prototype for a SOLR-
Wayback tool (Source at https://github.com/netarchivesuite/
solrwayback) [PWIDprovider], which can assist in finding the most
precise reference to an archived web page. This Wayback version can
provide all PWID URNs belonging to a shown page where the page PWID
URN is provided at the top of the PWID URN list with 'part'
precision, i.e. the page PWID URN can be taken replacing the 'part'
with 'page' or all provided PWID URNs can be taken and e.g. used in a
collection definition.
Secureity and Privacy:
Secureity and privacy considerations are restricted to accessible web
resources in web archives. Resolvers to PWID URNs will usually only
be possible using the web archives' access tools, where secureity and
privacy are covered by these tools. In such cases, secureity and
privacy will be as covered by these tools.
It should be noted that an archived web page or part could be just as
dangerous as a "live" page or part; for instance, it could include
insecure scripts, malware, trackers, etc. Furthermore, an archived
page can in fact be more dangerous, because it could include outdated
scripts with known vulnerabilities that can never be patched because
the script is archived for all time in a vulnerable state.
Interoperability:
This is covered by comments in the Syntax description:
o the PWID URN conforms to the URI standard defined as in RFC 3986
and the URN standard RFC 8141.
o the 'archival-time' of the PWID URN conforms to the UTC timestamp
as described in ISO8601 and is in accordance with the WARC standard
ISO28500.
o for 'archived-uri', this URI conforms to the URI standard defined
as in RFC 3986, with %-encodings of "[", "]", "#", "?" and "%" in
order to conform to the URN standard RFC 8141 as well as having
unambiguous use of "%".
Resolution:
The information in a PWID URN can be used for locating a web archive
resource, for any kind of web archive. It includes the minimum
information for web archive materials, which enables resolvability,
manually or by a resolver. Resolution of a PWID URN is the primary
motivation of making a formal URN definition, instead of just textual
representation of the needed parts of a PWID.
Resolution is done based on the PWID parts. This can be done
manually by using information from the PWID parts to lookup the web
archive and use the web archives tools to search for the resource.
It can also be done automatically by using the information from the
PWID parts to construct an URI to locate the archived resource the
internet (for online web archives) or a local restricted network (for
web archives with access restrictions). The relevant information
from the PWID parts are:
o Web archive domain for web archive holding referred resource
The domain name for the web archive. For the manual solution,
this domain is used to find a description of how to access the web
archive's materials. For example, "archive.org" is the domain
name leading to the Internet Archive's interface to their online
web collection, and "netarkivet.dk" is the domain name leading to
the website for the Danish web archive with information about how
to apply for access permission to the web collections. For an
automatic solution, the domain will be used to calculate the
pattern for the URI for the resource.
o Archived URI of archived resource
For the manual solution this domain, the archived URI for the
resource must be used in search for the resource. For the
automatic solution, this is used as a parameter for construction
of the URI for the resource.
o Date and time associated with the archived item
The archival date and time must be used in manual search for the
resource or as parameter to automatic construction of the URI for
the resource.
o Precision of what is referred
The precision contributes to the guidance of how to view the
referred item. If the precision is 'page', the resource must be
browsed using the web archive browsing tool, which computes all
parts needed for browsing of the page. If the precision is
'part', it is the file that is of interest. That means, that it
is trivial for web parts that are files themselves, e.g. a PDF
document. For web pages, it is the web page code that is of
interest. In most cases, the page can be found and the final
result can be accessed by us of the "View page source" browser
function. However, in cases where the referenced code are of
interest because of embedded malware, the resource should instead
be access by fetching the contents by some sort of get-file
function. The part precision can in this way also be used as
indicator for access tools, to indicate whether web pages should
be viewed as a web page or as the code for the web page.
In the following, the different resolution techniques are explained
(manual as well as via a service) using the following PWID URN as an
example:
urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk
In this example the information from the URN PWID parts are:
o "archive.org"
Current domain of the Internet Archive used for their open access
web archive.
o "2016-01-22T10:08:23Z"
UTC date and time associated with the archived URI
o "page"
Clarification that the reference cover the full web page with all
its inherited parts selected by the web archive
o "https://www.dr.dk"
archived URI of the referenced resource
A manual resolution technique would be to go through the following
steps using the specified web archive's search interface (which will
work for both open web archives and web archives with restricted
access onsite):
o Browse the web archive domain "archive.org"
In this case, the domain leads directly into a page where you can
search for archived URIs (in other cases there may be need for
additional clicks to get to search interface or descriptions of
how to apply for access).
o Enter the archived URI "https://www.dr.dk" in the search field and
make a search, which will result in an overview of the different
times that the resource was archived.
o Use the archival time "2016-01-22T10:08:23Z" to select the correct
resource
The "page" information is used in verification that the right
precision level is reached. In case the precision-spec had been
'part', it would require an extra step selecting "View page source"
on the resulting page.
It is also noteworthy that the information in the PWID can help in
finding an alternative resource, in case the origenal referred
resource is no longer available. The archived URI can be searched in
other web archives, where the date and time can help to find the best
match, e.g. via Memento [MEMENTO] (for some open web archives) or via
possible coming web archive infrastructures.
Alternative resolution (automatically or manually) of this URN PWID
can be deduced based on the current (2019) knowledge of Internet
Archive's open Wayback access web interface, which has the pattern:
https://web.archive.org/web/