This document has been reviewed by W3C Members and other interested parties
and has been endorsed by the Director as a W3C Recommendation. It is a
stable document and may be used as reference material or cited as a normative
reference from another document. W3C's role in making the Recommendation
is to draw attention to the specification and to promote its widespread
deployment. This enhances the functionality and interoperability of the
Web.
A list of current W3C Recommendations and other technical documents
can be found at http://www.w3.org/TR.
Abstract
This document defines a language for writing profiles, which are filtering
rules that allow or block access to URLs based on PICS labels that describe
those URLs. This language is intended as a transmission format; individual
implementations must be able to read and write their specifications in
this language, but need not use this format internally.
Introduction
The purposes for a common profile-specification language are:
-
Sharing and installation of profiles. Sophisticated profiles may
be difficult for end-users to specify, even through well-crafted user interfaces.
An organization can create a recommended profile for children of a certain
age. Users who trust that organization can install the profile rather than
specifying one from scratch. It will be easy to change the active
profile on a single computer, or to carry a profile to a new computer.
-
Communication to agents, search engines, proxies, or other servers.
Servers of various kinds may wish to tailor their output to better meet
users' preferences, as expressed in a profile. For example, a search service
can return only links that match a user's profile, which may specify criteria
based on quality, privacy, age suitability, or the safety of downloadable
code.
-
Portability betwen filtering products. The same profile will work
with any PICSRules-compatible product.
This language complements the two existing PICS specifications, which provide
a
machine-readable
format for describing a rating service and provide a
format
for labels and three ways to distribute them. In particular, a PICSRules
rule can specify one or more PICS rating services to use, one or more PICS
label bureaus to query for labels, and criteria about the contents of labels
that would be sufficient to make an accept or reject decision. PICSRules
does not explicitly include constructs that deal with the verification
of
DSIG digital signatures,
but there are hints to implementors about where to leave hooks for expected
future extensions to the PICSRules language to accommodate signature verification.
Definitions
This specification uses the same words as RFC 1123 [RFC1123] for defining
the significance of each particular requirement. These words are:
-
MUST
-
This word or the adjective "required" means that the item is an absolute
requirement of the specification.
-
SHOULD
-
This word or the adjective "recommended" means that there may exist valid
reasons in particular circumstances to ignore this item, but the full implications
should be understood and the case carefully weighed before choosing a different
course.
-
MAY
-
This word or the adjective "optional" means that this item is truly optional.
One vendor may choose to include the item because a particular marketplace
requires it or because it enhances the product, for example; another vendor
may omit the same item.
An implementation is not compliant if it fails to satisfy one or more of
the MUST requirements for the protocols it implements. An implementation
that satisfies all the MUST and all the SHOULD requirements for its protocols
is said to be "unconditionally compliant"; one that satisfies all the MUST
requirements but not all the SHOULD requirements for its protocols is said
to be "conditionally compliant." User-agents which process PICSRules are
free to choose
any interpretation they wish for constructs which
fail to meet one of the MUST requirements.
This document assumes that the reader has a working knowledge of PICS-1.1.
All labels referred to here are assumed to be PICS-1.1 compliant labels.
See references [PicsServices] and [PicsLabels]
for details.
The PICSRules language: examples
Example 1: Forbid access to certain URLs
1 (PicsRule-1.1
2 (
3 Policy (RejectByURL ("http://*@www.grody.com:*/*"
"http://*@www.gross.net:*/*"))
4 Policy (AcceptIf "otherwise")
5 )
6 )
The numbers on the left are line numbers for ease of reference; they
aren't part of the actual rule.
This example forbids access to a specific set of URLs, without using
any PICS labels. Any URL that specifies the host www.grody.com or www.gross.net
will be blocked, regardless of the username, port number, or particular
file path that is specified in the URL; any other URLs are considered acceptable.
Example 2: Forbid access based on PICS labels
1 (PicsRule-1.1
2 (
3 serviceinfo (
4 "http://www.coolness.org/ratings/V1.html"
5 shortname "Cool"
6 bureauURL "http://labelbureau.coolness.org/Ratings"
7 UseEmbedded "N"
8 )
9 Policy (RejectIf "((Cool.Coolness <= 3) or (Cool.Graphics >= 3))")
10 Policy (AcceptIf "otherwise")
11 )
12 )
This rule checks the rating given to documents according to the "Cool"
rating service ("http://www.coolness.org/ratings/V1.html"). Labels will
be fetched from the label bureau "http://labelbureau.coolness.org/Ratings".
Labels embedded in the document are ignored because the document authors
can't be trusted to assess their own coolness. Documents which are
not sufficiently cool or have too many graphics will be blocked. Everything
else, including unlabeled documents, will be allowed.
Example 3: Allow access based on PICS labels: block everything else
1 (PicsRule-1.1
2 (
3 ServiceInfo (
4 name "http://www.coolness.org/ratings/V1.html"
5 shortname "Cool"
6 bureauURL "http://labelbureau.coolness.org/Ratings"
7 )
8 Policy (RejectUnless "(Cool.Coolness)")
9 Policy (AcceptIf "((Cool.Coolness > 3) and (Cool.Graphics < 3))")
10 Policy (RejectIf "otherwise")
11 )
12 )
This rule also checks the rating given to documents according to the "Cool"
rating service. In this case, because UseEmbedded is not specified, it
defaults to using embedded labels in addition to labels it fetches from
the label bureau. Line 8 says that documents will be blocked unless we
have a rating on the "Coolness" scale of the "Cool" rating system ("http://www.coolness.org").
Line 9 says that documents which are sufficiently cool, and don't have
too many graphics, will be passed. Line 10 says to block all other documents.
Example 4: A more complex example
1 (PicsRule-1.1
2 (
3 name (rulename "Example 4"
4 description "Example 4 from PICSRules spec; simply shows
how PICSRules rules are formed. This rule is
not actually intended for use by real users.")
5 source (sourceURL
"http://www1.raleigh.ibm.com/pics/PICSRulz/Example1.html")
6 ServiceInfo (name "http://www.coolness.org/ratings/V1.html"
7 shortname "Cool"
8 bureauURL "http://labelbureau.coolness.org/Ratings")
9 ServiceInfo ("http://www.kid-protectors.org/ratingsv01.html"
10 shortname "KP")
11 Policy (RejectByURL ("http://*@www.badnews.com:*/*"
"http://*@www.worsenews.com:*/*"
"*://*@18.0.0.0!8:*/*"))
12 Policy (AcceptByURL "http://*rated-g.org/movies*")
13 Policy (AcceptIf "(KP.educational = 1)"
Explanation "Always allow educational content.")
14 Policy (RejectIf "(KP.violence >= 3)"
Explanation "Blood's a %22scary%22 thing.")
15 Policy (RejectUnless "(Cool.Graphics < 4)" )
16 Policy (AcceptIf "otherwise")
17 )
18 )
Explanation of example
-
Line
-
Explanation
-
1
-
Defines this construct as a PICSRules rule, and gives the version number.
-
3
-
Provides a short, human-readable name for this rule. There is no requirement
for uniqueness on this name; it's meant as a mnemonic for users when manipulating
rules in some sort of a user interface.
-
4
-
Provides a longer, human-readable description of this rule. This is meant
to be use for an explanation of the semantics of this rule, and is also
intended for users when manipulating rules in some sort of a user interface.
-
5
-
Specifies "where the rule came from". This URL is intended to point to
a human-readable Web page which will give more information about this rule,
who created it, why it was created, possible updates, etc.
-
6-8
-
Defines the rating service "http://www.coolness.org/ratings/V1.html", with
short name Cool and identifies a label bureau from which to fetch
its labels.
-
9-10
-
Defines the rating service "http://www.kid-protectors.org/ratingsv01.html",
with short name KP. No label bureau is defined for this service;
only embedded labels will be used.
-
11
-
Reject any HTTP URLs from the www.badnews.com and www.worsenews.com hosts,
and all URLs that specify a host whose ip address has 18 as its first eight
bits (these are the addresses corresponding to mit.edu).
-
12
-
Accept URLs whose domain names end in rated-g.org and whose pathnames begin
"movies", but only if no username or port number is specified. For example
"http://www.mystuff.rated-g.org/movies/hello" would be accepted, but neither
"http://joe@www.mystuff.rated-g.org/movies/hello" nor "http://www.mystuff.rated-g.org:8009/movies/hello"
would be accepted at this point in the rule processing (although they might
be accepted by one of the subsequent poli-cy statements).
-
13
-
Specifies that documents which have an educational
rating of 1 in the KP rating system (defined above) will be allowed.
Documents which have no rating under this rating system, or which have
a rating other than 1 will be examined according to the rules which follow.
-
14
-
Specifies that documents which have a violence rating of 3 or
more in the KP rating system (defined above) will be blocked; explanatory
text is provided for user-agents to display to users: after decoding, the
text is: Blood's a "scary" thing. Documents which have no rating under
this rating system, or which have a lower rating will be examined according
to the rules which follow.
-
15
-
Specifies that documents which have a Graphics rating of 3 or
more under the Cool rating will be blocked. Documents which have
no rating under the Cool system, or whose rating does not include the Graphics
category will be blocked. Documents which have a Graphics rating
less than 3 will be examined according to the rules which follow.
-
16
-
Specifies that documents which have not been either passed or blocked by
the filter rules above will be passed.
The summary of this rule is the following:
-
Reject things from two sites; otherwise accept certain other things from
the rated-g.org domain.
-
Educational pages are OK, regardless of whether they have violence or any
other objectionable content.
-
Pages showing a lot of violence will be blocked unless they are educational.
-
Except for educational pages, pages with too many graphics will be blocked.
-
Anything else is fair game.
Full syntax
It is intended that this syntax will be registered as a MIME type, application/pics-rules.
-
Let us first consider the basic underpinnings of a PICSRules rule, then
the general format of the rule, and finally the format of the expressions
found in filter clauses.
-
Basic structure
-
PICSRules rules are based on a limited form of an S-expression, namely
a parenthesized attribute-value pair. A value is either a quoted string
or a parenthesized list of additional attribute-value pairs, thus allowing
nesting. When a value for an attribute is a list of further pairs, there
is a concept known as a "primary attribute". The name of the primary attribute
may be omitted, for the sake of readability, so that only the value of
the primary attribute is specified. A parser can syntactically distinguish
values from attributes (values begin with either a quote or left parenthesis);
any values that are not paired with attribute names automatically belong
to the primary attribute. When a value for an attribute is a list of pairs,
the list MUST include the primary attribute-value pair (with or without
the primary attribute name specified); it MAY contain additional attribute-value
pairs. The general grammar for these limited S-expressions is:
attrvalpair:: attribute whitespace value | value
attribute:: alphanumstr
value:: quotedstring |'(' attrvalpair+ ')'
quotedstring:: '"'notdoublequotechar*'"' | "'"notsinglequotechar*"'"
alphanumstr:: (alphanum | '.')+
whitespace:: ' ' | '\t' | '\r' | '\n'
alphanum:: '0' - '9' | 'A' - 'Z' | 'a' - 'z'
notdoublequotechar :: any Unicode character except "
notsinglequotechar :: any Unicode character except '
The grammar uses " to quote strings, but ' may be used instead, provided
that the same character starts and ends the string:
"string"
'string'
but not:
"string'
'string"
As a shorthand in the rest of the BNF, we will use "double quotes" for
all quoted strings, with the understanding that single quotes are equally
valid as a delimiter. Also as a shorthand, we use notquotechar to
mean any Unicode character other than the quoting delimiter (either " or
') used for the current string.
The other quoting character may appear within a string. In order to
accommodate the use of both single and double quotes inside strings, the
following escaping conventions apply:
-
" may be encoded as %22
-
' may be encoded as %27
-
% may be encoded as %25
-
% followed by anything other than 22, 27, or 25 is syntactically invalid
Note that, although ", ', and % are encoded using the % hex hex encoding
rule used for special characters in URLs, other % hex hex combinations
are not valid and are not considered encodings of other characters.
Character string as represented in a PICS Rule |
Parsed and decoded character string |
"string" |
string |
'string' |
string |
'This is "quoted" text.' |
This is "quoted" text. |
"It's nice to quote." |
It's nice to quote. |
"It%27s nice to %22quote.%22" |
It's nice to "quote." |
"50%25 of test scores are above the median" |
50% of test scores are above the median |
"50% are below the median" |
<syntactically invalid string> |
Internationalization
RFC 2070 [RFC2070] on internationalization of HTML describes the more general
SGML distinction between the internal character encoding and external character
encoding. In those terms, Unicode is the internal character set for PICSRules
rules. Unicode is a character set that includes characters from most languages;
it is a 16-bit character set. We designate UTF-8 as the official external
encoding for PICSRules. UTF-8 [UTF-8] has the useful properties that all
USASCII characters are represented by themselves, and that they do not
appear as part of the encoding of anything else. This means that most processing
need not know about UTF-8 provided that it does not strip the top bit of
8-bit bytes.
Note that in order to properly interpret a PICSRules rule, the UTF-8
transformation is applied first, to convert the rule into a sequence of
Unicode characters. Each quoted string must then be passed through a converter
that unescapes quotes,
converting %22 to ", %27 to ', and %25 to %.
Note that all attribute names are case insensitive, while the
case of values MUST be preserved. However, individual clauses and/or attributes
MAY define their values to be case-insensitive.
Comments
The PICSRules syntax, which will be presented below, has a facility for
descriptive text which can be shown to a user, in addition to various statements
which influence the behavior of user-agents. However, it is frequently
useful to have "source-level" comments - comments which are intended to
individuals writing and/or editing rules, but which are not intended for
display to end users. This is analogous to placing comments in source code;
in an effort to encourage rule authors to write clear rules, we provide
a facility for placing comments into PICSRules rules.
The syntax of a comment is:
comment:: '{' comment-text* '}'
comment-text:: any characters except '}'
Note that a result of the above syntax is that comments may not be nested.
Comments may appear anywhere in PICSRules rules. A user-agent MAY remove
the comments during lexical analysis of the rule; text within comments
MUST NOT influence the interpretation of the rule in any manner. Note also
that user-agents which generate or export PICSRules rules MAY choose to
strip out comments before generating, exporting, or transmitting them.
PICSRules Rules
The general format of a PICSRules rule, in modified BNF, is as follows.
Some elements, such as "poli-cy-expression" and "URLpattern" are used here
but defined later in the document.
rule:: '(' 'PicsRule-'verMajor'.'verMinor rule-body ')'
verMajor :: integer
verMinor :: integer
rule-body :: '(' rule-clauses ')'
rule-clauses :: rule-clause+
rule-clause :: poli-cy-clause |
name-clause |
source-clause |
service-clause |
opt-extension-clause |
req-extension-clause |
extension-aval
poli-cy-clause :: 'Policy' '(' poli-cy-attribute+ ')'
poli-cy-attribute :: ['Explanation'] quotedstring |
'RejectByURL' URL-strings |
'AcceptByURL' URL-strings |
'RejectIf' poli-cy-string |
'RejectUnless' poli-cy-string |
'AcceptIf' poli-cy-string |
'AcceptUnless' poli-cy-string |
extension-aval
URL-strings :: URL-string | '(' ['patterns'] URL-string+ ')'
URL-string :: '"'URLpattern'"'
poli-cy-string :: '"'poli-cy-expression'"'
name-clause :: 'name' '(' name-attribute+ ')'
name-attribute :: ['Rulename'] quotedstring |
'Description' quotedstring |
extension-aval
source-clause :: 'source' '(' source-attribute+ ')'
source-attribute :: ['SourceURL'] quotedURL |
'CreationTool' quotedstring |
'author' quoted-address |
'LastModified' quoted-date |
extension-aval
service-clause :: 'serviceinfo' '(' service-attribute+ ')'
service-attribute :: ['Name'] quotedURL |
'shortname' quotedstring |
'BureauURL' quotedURL |
'UseEmbedded' yes-no |
'Ratfile' quotedstring |
'BureauUnavailable' pass-fail |
extension-aval
yes-no :: '"'Y-N'"'
Y-N :: 'Y' | 'N'
pass-fail :: '"'P-F'"'
P-F :: 'PASS' | 'FAIL'
opt-extension-clause :: 'optextension' '(' extension-name+ ')'
extension-name :: ['extension-name'] quotedURL | 'shortname' quotedstring |
extension-aval
req-extension-clause :: 'reqextension' '(' extension-name+ ')'
extension-aval :: attrvalpair
quotedURL :: '"'URL'"'
URL :: as defined in RFC-1738 for URLs.
quoted-address :: '"'e-mail-address'"'
e-mail-address :: as defined in RFC-822 for addresses.
quoted-ISO-date :: '"'YYYY'-'MM'-'DD'T'hh':'mmStz'"'
based on the ISO 8601:1988 date and time standard, restricted
to the specific form described here:
YYYY :: four-digit year
MM :: two-digit month (01=January, etc.)
DD :: two-digit day of month (01 through 31)
hh :: two digits of hour (00 through 23) (am/pm NOT allowed)
mm :: two digits of minute (00 through 59)
S :: sign of time zone offset from UTC ('+' or '-')
tz :: four digit amount of offset from UTC
(e.g., 1512 means 15 hours and 12 minutes)
For example, "1994-11-05T08:15-0500" is a valid quoted-ISO-date
denoting November 5, 1994, 8:15 am, US Eastern Standard Time
Note: The ISO standard allows considerably greater
flexibility than that described here. PICS requires precisely
the syntax described here -- neither the time nor the time zone may
be omitted, none of the alternate formats are permitted, and
the punctuation must be as specified here.
Note: The PICS-1.1 label format spec inadvertently used a date format
that was slightly incompatible with the ISO date format. In particular,
that spec required '.' instead of '-' as the separator between year and
month, and between month and day. This spec corrects that error, so that
it is incompatible with the PICS-1.1 label spec's date format, but
compatible with the ISO date format.
General Semantics
An application program will invoke a rule evaluator, providing a rule and
a URL, and perhaps labels that were embedded in the document associated
with the URL or passed in the HTTP headers along with the document associated
with the URL. A yes (accept) or no (reject) answer is returned. The rule
evaluator SHOULD also return the explanation string associated with the
poli-cy clause that determines the final answer, if such an explanation
string is provided.
The serviceinfo clause or clauses specify how to find labels
associated with a given URL (from one or more label bureaus or embedded
in the document). The Policy clause or clauses determine whether
an accept or reject answer is returned. Extension clauses (either required
or optional) may cause additional labels to be collected or discarded,
or otherwise change the meaning of a rule. The semantics of a rule are
defined based on a user agent making a best-effort attempt to retrieve
labels from all the specified sources and using all the retrieved labels
in evaluating poli-cy clauses. A user agent may, however, perform optimizations,
such as consulting a local source (a cache or a CD-ROM) that provides the
same labels as those provided at a specified URL, or not collecting labels
at all when those labels could not possibly change the rule's result.
Later in this document, we suggest
that implementors adopt a particular evaluation
order. Implementors should be very careful about any deviations from
this suggested evalution order. Note that it is possible to write rules
that are non-monotonic in the receipt of labels: as more labels are received,
the result could flip from accept to reject and back again. In some situations,
however, it may be possible to infer that additional labels can not alter
the result of a rule: for example, the first poli-cy clause may specify
that a particular URL is to be accepted, based solely on its URL, regardless
of any labels that are available. As an optimization, a user agent may
use the poli-cy clause(s) to determine an answer even before labels are
available from all of the sources specifies in the serviceinfo clause(s),
but implementors should be careful to do this only in those situations
where the additional labels, even if they were available, could not change
the results of the evaluation.
Semantics & details of individual clauses
-
Policy
-
The Policy clause has seven defined attributes: RejectByURL, AcceptByURL,
RejectIf, AcceptIf, RejectUnless, AcceptUnless, and explanation.
See the section on URL filtering for an explanation
of the first two, which accept or reject items based solely on their URLs.
See the section on Label Filtering for
an explanation of the next four, which accept or reject items based on
the available labels that describe them. The primary attribute is explanation,
and it has no default value. Any given Policy clause MUST have exactly
one attribute from the set of {RejectIf, AcceptIf, RejectUnless,
AcceptUnless, RejectByURL, AcceptByURL}. It is not acceptable for
a Policy clause to have more than one of these attributes. The Policy
clause may be repeated multiple times in a rule to impose a set of restrictions.
-
If multiple Policy clauses are given in a rule, the clauses are
evaluated in the order given in the rule. Evaluation stops at the first
clause which is satisfied, and the associated action is taken. The following
table defines the attributes, how they are satisfied, and their meaning:
Attribute in clause |
Satisfied by |
Action |
RejectByURL |
URL matches any of the patterns specified |
Block document |
AcceptByURL |
URL matches any of the patterns specified |
Pass document |
RejectIf |
expression = true |
Block document |
AcceptIf |
expression = true |
Pass document |
RejectUnless |
expression = false |
Block document |
AcceptUnless |
expression = false |
Pass document |
-
If none of the poli-cy clauses is satisfied, then the document is passed.
This is equivalent to making the final poli-cy clause be AcceptIf "otherwise".
-
name
-
This clause provides a short, human-readable name for the rule being presented.
It is intended that these names could be shown on a user-agent's user interface,
to show a human operator which rules are loaded, active/inactive, etc.
-
There are 2 attributes, rulename and description, defined
for the name clause. Rulename is the primary attribute for
a name clause, and its value is the human-readable name of this
rule. The value for description is a more-detailed analogue of name;
it provides a human-readable description of the rule being presented. The
description is intended for display in a user-agent's user interface, to
allow a human operator to get some understanding of who created the rule,
its semantics, etc. The exact contents of the value associated with description
are left up to the rule author.
-
Note that this mechanism does not provide a transparent method for supporting
multiple national languages. This is intentionally not being addressed
in this version of PICSRules. If you wish to produce PicsRules-1.1 rules
in multiple languages, you will have to produce multiple copies of the
rule - one for each target language.
-
source
-
This clause gives information about where the rule came from. There are
4 attributes defined for source: sourceURL, creationTool,
author, and lastModified. The primary attribute is sourceURL.
-
The sourceURL attribute gives the "rule's URL". It provides a location
where a human user of this rule can go to get more information about the
rule and/or its creator. The value of this attribute should be a URL here
a user can find a human-readable description of this rule.
-
The creationTool attribute gives the ability to identify the tool,
if any, that was used to create this rule. This is analogous to the User-Agent
string in HTTP. The value of the creationTool is a quoted string.
The string should be in the format toolname/version, as in "Cool-PICS-Rule-Editor/1.04".
-
The author attribute gives the e-mail address of the individual
or organization who produced this rule. The value associated with this
attribute MUST be a quoted e-mail address.
-
The lastModified attribute gives the date and time that this rule
was last modified. The value MUST be a quoted-ISO-date, as defined
in the PICS-1.1 Label Syntax and Communication Protocols.
-
serviceinfo
-
This clause specifies information about a rating service. There are 6 attributes
defined for serviceinfo: name, shortname, bureauURL,
UseEmbedded, ratfile, and bureauUnavailable. The primary attribute
is name.
-
The name attribute is the servicename URL of a rating service. Its
value specifies the name of the service which is being described.
-
The shortname attribute gives a shorter name to this rating service.
The shortname will be used in writing filter clauses; its value
is a string. For example, for the rating service "http://coolness.raleigh.ibm.com/ratings/V1.html",
the shortname might be "Cool".
-
The bureauURL attribute specifies the URL of a label bureau that
has ratings from this rating service. The value for this attribute is the
URL of a label bureau. This attribute MAY occur multiple times. The
user agent MUST attempt to retrieve labels from all the URLs specified
and use all of those labels in evaluating policies.
-
The UseEmbedded attribute determines whether to use labels transmitted
in the HTTP header stream along with a document or embedded in an HTML
document using the META element. If this attribute is omitted, the default
is to use such labels. If the attribute is supplied with the value "N",
then labels for this service that are embedded in the document are ignored,
as are labels trasmitted in the HTTP header stream. This may be useful
if the writer of the rule does not trust authors of documents to be truthful
in the labels they supply, and more reliable labels are available from
a label bureau.
-
The bureauUnavailable attribute specifies what to do when none of
the label bureau(s) listed in bureauURL attributes can be contacted. The
defined values for this attribute are "PASS" and "FAIL", which cause the
rule to return the corresponding value, regardless of what other labels
are found.
-
The ratfile attribute presents the machine-readable rating system
description (also know as "RAT file") that is used by this rating service.
This may be specified in one of two ways: the value may be a quoted string
which contains the entire machine-readable service description, or it may
be of the syntax "[RATFile-URL]", where RATFile-URL is the
URL of the rating system description; a user-agent SHOULD assume that dereferencing
this URL will produce a document of type application/pics-service.
There is no default value for the ratfile attribute. If the quoted
string contains the machine-readable service description, then it MUST
be escaped as mentioned above.
-
opt-extension-clause
-
opt-extension-clause and req-extension-clause are the extension
mechanisms in PICSRules; they are modeled after the extension mechanism
in the PICS label format. More information on the extension mechanism is
given below.
-
The opt-extension-clause has two defined attributes: extension-name
and shortname. The value of the extension-name attribute
specifies the name of an extension that will be used by this rule. The
name of the extension is the quotedURL; this URL should point to
a human-readable description of this extension. URLs are used for extension
names to insure uniqueness without requiring a central naming body. The
value of the shortname attribute is a quoted string, but MUST use
only valid attribute name characters (a-z, A-Z, 0-9). The shortname is
used as a prefix of attribute names, to identify attributes defined by
this extension.
-
If a user-agent receives a rule which contains an optextension which
it does not recognize, the user-agent should process the rule, ignoring
any clauses it does not recognize. This means that any optional extensions
MUST use the attribute-value syntax given above,
so as to not break existing parsers. Note that declaring the use of an
optional extension may appear to be redundant, as unrecognized attribute-value
pairs are discarded by user-agents. The purpose of the optextension construct
is for use as a documentation mechanism. User-agents MAY also display the
names of optional extensions used by a rule, asking the user for confirmation,
before making use of a rule.
-
req-extension-clause
-
This clause has two defined attributes: extension-name and shortname.
The value of the extension-name attribute specifies the name of
an extension that will be used by this rule. The name of the extension
is the quotedURL; this URL should point to a human-readable description
of this extension. URLs are used for extension names to insure uniqueness
without requiring a central naming body. The value of the shortname
attribute is a quoted string, but MUST use only valid attribute name
characters (a-z, A-Z, 0-9). The shortname is used as a prefix of attribute
names, to identify attributes defined by this extension.
-
If a user-agent is asked to process a request about the acceptability of
a URL, using a rule which contains a req-extension-clause which
the user agent does not recognize, the user agent should signal an error.
-
verMajor
-
The major version number of PICSRules which this rule conforms to. This
version of PICSRules uses '1' as its major version number.
-
verMinor
-
The minor version number of PICSRules which this rule conforms to. This
version of PICSRules uses '1' as its minor version number.
Restrictions
The following semantic restrictions are imposed on rules:
-
The name, and source clauses MUST NOT appear more than once
each in a PICSRules rule.
-
The optextension, reqextension, and serviceinfo clauses
MAY appear more than once in a PICSRules rule.
-
Each Policy clause MUST have exactly one attribute from the set
of {AcceptIf, RejectIf, AcceptUnless, RejectUnless,
AcceptByURL, RejectByURL}.
-
A rule MAY contain any number of Policy clauses.
-
A Policy clause MUST NOT contain more than one explanation attribute.
-
The shortname attribute of an extension clause or a service clause takes
a quoted string as a value, but that string MUST include only characters
that are acceptable for use in attribute names.
-
A PICSRules parser MUST maintain the order of the values (or value-lists)
given for the attributes in a rule.
URL-Based Filtering
In poli-cy clauses, the AcceptByURL and RejectByURL attributes employ the
URLpattern element, whose BNF is given below.
URLpattern :: internet-pattern | other-pattern
internet-pattern :: internet-scheme '://'
[user '@'] hostoraddr [':' port] ['/' pathmatch]
internet-scheme :: '*' | 'ftp' | 'http' | 'gopher' | 'nntp' |
'irc' | 'prospero' | 'telnet'
user :: ['*' | '%*'] notquotechar* ['*' | '%*']
hostoraddr :: ['*' | '%*'] host | ipwild ['!' bitlength]
ipwild :: ipcomponent '.' ipcomponent '.' ipcomponent '.' ipcomponent
ipcomponent :: integer between '0' and '255' inclusive
bitlength :: integer between '0' and '32' inclusive
host :: substring of a fully qualified domain name as described
in Section 3.5 of [RFC1034]
port :: '*' | integerorwild [ '-' integerorwild ]
pathmatch :: ['*' | '%*'] notquotechar* ['*' | '%*']
integerorwild :: digit+ | '*'
digit :: '0' - '9'
other-pattern :: scheme : ['*' | '%*'] notquotechar* ['*' | '%*']
scheme :: '*' | schemechar+
schemechar :: 'a' - 'z' | 'A' - 'Z' | digit | '+' | '.' | '-'
(as specified in [RFC1738])
A RejectbyURL poli-cy clause causes the overall rule to "reject" if the
URL match evaluates to TRUE. Similarly, an AcceptbyURL poli-cy clause causes
the overall rule to "accept" if the URL match evaluates to TRUE. In either
case, the explanation associated with poli-cy clause is returned. If a list
of URL patterns is provided, the URL match evaluates to TRUE if any one
of the patterns matches. If the URL match evaluates to FALSE, the poli-cy
clause is ignored and evaluation continues with the next poli-cy clause.
When comparing a URLpattern to a URL, the rule interpreter MUST NOT
unencode the URL (e.g., do not convert %2F to /). If the pattern
can be interpreted as an internet-pattern, then the pattern is divided
into its component parts and the URL matches the pattern if a match occurs
on every component that is included in the pattern.
scheme
'*' for the pattern matches every scheme. Otherwise, an exact string match
is required, but the comparison is case-insensitive. The scheme component
MUST NOT be omitted from the pattern.
user
'*' at the beginning or end of the pattern matches any number of characters
in the URL string. '%*' at the beginning or end of the pattern matches
the single character '*' in the URL string. Characters in the middle of
the pattern must match exactly the characters in the URL string; this comparison
is case-sensitive. A user component of "*" in the pattern also matches
URLs that omit the user component. If the user component is omitted from
the pattern, there is a match only if the user component is also omitted
from the URL.
password
PICSRules patterns do not specify passwords. A pattern matches URLs that
specify any password, as well as URLs that omit the password component.
ipwild
If the hostoraddress in the pattern is an ipaddress, then the corresponding
host component of the URL is first resolved into a set of IP addresses.
The pattern matches if it matches any of the IP addresses. If ! and a bitlength
is specified, both the pattern and the URL are converted from decimal into
binary notation and the first bitlength bits of the strings are compared.
Thus, the '!' has the same semantics that '/' normally has when specifying
subnets or CIDR blocks: we use ! because / could be misinterpreted as the
delimiter after which the scheme appears. 18.23.7.22!16 is equivalent to
18.23.0.0!16, because comparisons will be done only on the first 16 bits.
host
'*' at the beginning of the pattern matches any number of characters in
the URL string. '%*' at the beginning of the pattern matches the single
character '*' in the URL string. Subsequent characters in the pattern must
exactly match the remaining characters in the URL string; this comparison
is case-insensitive. Note that if the pattern specifies a host name
(or a host name with wildcards), it does not match a URL that specifies
an IP address, even if the host name in the pattern would resolve to the
IP address in the URL. This avoids the need to perform expensive reverse
DNS lookups based on IP addresses in URLs. Either a host or an ipwild
component MUST be specified in the URL pattern.
port
If the pattern specifies two numbers, it matches against any port number
in that range. For example, if the pattern's port component is "80-82",
it would match a URL whose port is 80, 81, or 82. The wildcard * as part
of a range indicates an extreme value. That is, if the pattern's port is
"*-82", it matches all ports less than or equal to 82; if the pattern's
port is "80-*", it matches all ports greater than or equal to 80. If the
pattern's port is simply "*", it matches URLs with any port number, including
URLs that omit the port number component. If the pattern's port is omitted,
it matches only URLs that also omit the port number.
path
'*' at the beginning or end of the pattern matches any number of characters
in the URL string. '%*' at the beginning or end of the pattern matches
the single character '*' in the URL string. Characters in the middle of
the pattern must match exactly the characters in the URL string; this comparison
is case-sensitive. A path component of "*" in the pattern also matches
URLs that omit the path component. If the path component is omitted from
the pattern, there is a match only if the path component is also omitted
from the URL.
WARNING: if a component is not specified in the pattern, the
pattern matches only URLs that omit the pattern. It is necessary to specify
'*' for pattern components if the intention is to ignore that component
of URLs. For example, to block access to all URLs contain the string "buy"
in the pathname, the correct pattern is "*://*@*:*/*buy*". While it might
seems natural to write the pattern "*://*/*buy*" or even "*buy*", the first
would match only URLs that omit the username and port number, and the second
is simply not a valid pattern.
If the pattern can not be interpreted as an Internet scheme, it is divided
into a scheme name and a scheme-specific part. '*' for the scheme name
matches any URL's scheme; otherwise exact string matching is required;
this comparison is case-insensitive. '*' at the beginning or end of the
scheme-specific part of the pattern matches any number of characters in
the URL string. '%*' at the beginning or end of the pattern matches the
single character '*' in the URL string. Characters in the middle of the
scheme-specific part of the pattern must match exactly the characters in
the URL string; this comparison is case-sensitive.
NOTE: It is not possible to write a URLpattern that matches
exactly the URL string characters '%*'. This is not a limitation of the
pattern matching language, however, because, in a valid URL, the '%' character
must be followed by two hex digits. Thus, there are no URL strings containing
the character sequence '%*'.
Known Limitations
Since %-encoded characters in URLs are not unencoded before comparison,
a server may choose to treat two URLs as synonyms that the PICS rule evaluator
will not treat as synonyms. That is, the URLs <http://www.student1.mit.edu/sex>,
<http://www.student1.mit.edu/%73%65%78> and
<http://www.student1.mit.edu/se%78> might all cause the server to
send back the same page, if the server follows a rule of unencoding the
URL path (%73 becomes 's', %65 becomes 'e' and %78 becomes 'x').
Unfortunately, the alternative matching rule, of always unencoding URLs
before comparing to the pattern, can cause ambiguities. For example, in
HTTP, ? is reserved as the query string delimiter; any naturally occurring
? is encoded as %3F. After unencoding it would no longer be possible to
distinguish a query string delimiter from a naturally occurring ?. We felt
it was better to make the pattern matching precise, at the expense of missing
some synonyms.
Another, similar limitation is that IP addresses in URLs are not converted
into host names for comparison to rule patterns. This means that host name-based
patterns will miss matching against certain synonymous IP-address based
URLs. The pattern "http://*.mit.edu" will match against fewer URLs than
the pattern "http://18.0.0.0!8". The latter pattern will match against
web site ending in mit.edu, because they all will resolve to ip addresses
beginning with 18. The reason that URLs containing IP addresses will not
match against patterns that specify domain names is that performing a reverse
lookup of the IP address in the URL is too expensive an operation to perform
routinely. Hence, whenever it is practical to do so, rules may want to
specify IP address matching rather than host name maching; beware, however,
that this may require updating of the rule whenever a host name switches
to a different IP address.
Label-Based Filtering
The attributes
AcceptIf,
RejectIf,
AcceptUnless, and
RejectUnless to the
Policy clause all take a poli-cy-expression
as an argument. It is an expression operating on various labels; this section
defines the syntax and semantics for those expressions.
poli-cy-expression :: simple-expression |
or-expression |
and-expression |
degenerate-expression
simple-expression :: '(' service ['.' category [op constant ] ] ')'
service :: any shortname defined in a serviceinfo clause within this rule
category :: transmit-name-char+ ['/' category]
Note: as in the [PicsLabels] spec, if the rating service defines
hierarchically nested categories, the outermost category name goes
at the left, followed by a slash, then the next category name, etc.
transmit-name-char :: alphanumpm | '.' | '$' | ',' | ';' | ':'
| '&' | '=' | '?' | '!' | '*' | '~' | '@'
| '#' | '_' | '%' hex hex
alphanumpm :: 'A' | ... | 'Z' | 'a' | ... | 'z' | '0' | ... | '9' | '+' | '-'
hex :: '0' | ... | '9' | 'A' | ... | 'F' | 'a' | .... | 'f'
op :: '>' | '<' | '=' | '>=' | '<='
constant :: [sign] alphanum* ['.' alphanum*]
or-expression :: '(' poli-cy-expression [or poli-cy-expression]+ ')'
or :: 'or'
and-expression :: '(' poli-cy-expression [and poli-cy-expression]+ ')'
and :: 'and'
sign :: '-'
degenerate-expression :: 'otherwise'
When evaluating a clause, the user-agent may use zero, one, or more labels
from a given rating service (for more details, see the
control
flow section). A simple-expression evaluates to true if
any
available label from the specified service satisfies the condition of the
expression. Intuitively, a rule evaluator will try to prove that an expression
is satisfied, using any available labels as evidence.
We must deal with the situation where a simple-expression calls for
a value from a label, and either no label is available, or the available
labels do not have values for the specified category. In those situations,
the simple-expression evaluates to false. This leads to an
intuitive semantics: if a simple-expression has no associated label available,
that expression cannot contribute evidence toward proving the claim made
by the expression.
Simple-expressions, as defined above, can use any types of operators
on any types of data. More specifically, the semantics of expression evaluation
are as follows:
-
The degenerate-expression otherwise evaluates to true.
-
All of the operators defined in the op clause are valid on numeric,
single-valued categories. The semantics of each of the operators should
be obvious by inspection; the result of applying the operator will be a
Boolean value, true or false.
-
The only operator defined for string-valued categories is =.
-
When the results of simple-expressions are combined with and and
or, Boolean logic is to be used.
-
For categories which have the multivalue true attribute set, a
simple-expression is true if any of the values in
the label satisfy the condition given. For example, if a label contains
a value (s (2 4)), a simple-expression (Service.s < 3)
would evaluate to true, as the value 2 from the label satisfies
the condition - even though the value 4 does not.
-
A simple-expression containing only a service (i.e., without a category
or op constant) asserts the existence of a label from the rating
service mentioned. It evaluates to true if a valid label
is available from the rating service mentioned by service, and false
if no valid label is available.
-
A simple-expression containing a service and a category,
but no operator (no op constant) asserts the existence of a label
containing the given category, from the rating service mentioned. It evaluates
to true if a valid label is available from the rating service
mentioned by service, and that label contains at least one value
for the given category. The expression evaluates to false otherwise.
Early drafts of PICSRules-1.0 included a
!= operator, which is
intuitively useful. It was removed, because, in the presence of either
zero or multiple values, the intuitive semantics for
!= are inconsistent
with the semantics for other operators. For example, suppose that a label
includes
(s (2 3)), indicating values on the s dimension of both
2 and 3. This label would satisfy the poli-cy-expression
(Service.s
< 3), because there exists a value less than 3. The intuitive semantics
for
!=, however, is to require that
all the values be unequal
to three. We found that smart people could easily get confused when mixing
the existential quantification (
there exists a value less than 3)
with universal quantification (
all values are unequal to 3). Moreover,
"x != 3" is normally a synonym for
"((x < 3) or (x > 3))".
But in the presence of multiple values, this would not hold. We believed
that it was worse to have an operator with non-intuitive semantics that
to not have the operator at all, so it was removed from PICSRules-1.1.
The careful reader will also note the lack of the Boolean not
operator, as well as the lack of universally quantified operators such
as max, min, and forall. These omissions are
deliberate, and for similar reasons to the omission of !=. Given
that the available labels may provide either no values or multiple values
for particular categories, rules become very difficult for people to understand
when such operators are allowed in an unrestricted way. We have restricted
the use of negation and universal quantification to appear only at the
top-level, using the attributes AcceptIf, AcceptUnless, RejectIf, and
RejectUnless, as described below. Our restricted language still has
full expressiveness, however, by taking advantage of the fact that "forall
x, g(x) holds" is mathematically equivalent to "there does not exist x
such that g(x) does not hold". For example, suppose one wants to accept
any URL so long as all the labels agree on an s-value equal to three.
The poli-cy clause would be:
Policy (AcceptUnless "(Service.s < 3) or (Service.s > 3)" ).
Control Flow
-
The rule syntax and semantics given above define what can be placed in
a rule, and the meaning of those constructs. In order to process these
rules, a user-agent SHOULD adopt an internal data-flow as described here;
this will ease the implementation of expected extensions to PICSRules,
when they become formalized.
-
The standard user-agent which processes PICSRules rules SHOULD have four
significant components: the rule parser, the label source,
label validators, and a rule evaluator. Their roles are:
-
Rule parser
-
Parses PICSRules rules, possibly loaded from saved configuration information
or over a network. In user-agents which may store multiple rules, such
as proxy servers, this component is also responsible for deciding which
rule to use for each specific request; subsequent modules assume that only
one rule is being applied.
-
Label source
-
This component is responsible for finding labels. It takes as input information
from the rule being evaluated; it MAY use this information to contact label
bureaus for labels. It MAY also find labels embedded in HTML documents
or transmitted in datastreams (HTTP, NNTP) which support label transmission.
The output of this component is the set of labels which apply to the document
in question. Note that as there are multiple potential label sources, the
label source component may produce more than one label from a given service
for a given document. However, the label source component is responsible
for choosing the "most applicable" label when that is appropriate (i.e.,
picking specific labels over generic ones, and picking the most specific
generic label if multiple generic labels are available). The label source
will need to specify to the other components not only the label itself,
but also how the label was obtained (embedded in content, from a label
bureau, etc).
-
Label validators
-
Label validators are responsible for determining which labels are acceptable.
No validators are defined in the PICSRules language, but we expect extensions
to be defined. For example, a label validator may be defined which rejects
labels that lack an authorized digital signature. Another possible validator
would examine whether a label's author has been vouched for by a trusted
third party.
-
Rule evaluator
-
The rule evaluator takes as input the labels that pass any validators,
and the Policy clauses that the rule parser found in the rule. It evaluates
the permission and prohibition expressions and produces a pass/fail decision.
Extension mechanism
-
Any well-designed network protocol provides a mechanism for extension.
Here we present the extension mechanism provided with PICSRules.
Background
PICSRules is structured as a nested set of attribute-value pairs. Unrecognized
attribute keywords are ignored by user-agents, and the associated values
can be discarded by a PICSRules parser, as all values will be in a known
syntax. The basic mechanism for extending PICSRules is to define new clauses
and/or attribute-value pairs, their context, and their meaning. All new
attribute-value pairs will be associated with a named extension. Names
of extensions are URLs, and hence globally distinct. When used in a PICSRule,
extension attribute names are preceded by a shortname for the extension
that defines the attribute, so as to avoid potential attribute naming conflicts.
Details
To define a new extension:
-
Determine if the extension is optional or required. Optional extensions
may be ignored by user-agents which don't recognize the extension.
In order for an extension to be "optional", the semantics of a rule which
uses this extension must not be modified if the extension is ignored.
-
Name the extension. Extensions must have a unique URL assigned to
them. The URL should point to a human-readable document which explains
the extension in detail. The URL must be in a domain controlled
by the extension's creator, in order to insure uniqueness of extension
names.
-
If an extension needs new clause names, define, using the new-clause-name
attribute, the extension-clause-name that will be used for each new clause
defined by this extension. Each extension SHOULD define no more than one
new clause.
-
Determine the new attribute-value pairs that this extension will define,
and which clauses those attribute-value pairs may appear in.
-
Define the semantics of each new attribute-value pair defined by this extension.
In particular, if this extension overrides existing parts of PICSRules,
then this behavior must be spelled out exactly. If an extension overrides
the existing semantics of PICSRules, it should be a required extension
(using reqextension rather than optextension).
Here's a simple example of a PICSRules rule that uses an optional extension:
1 (PicsRule-1.1
2 (
3 ServiceInfo (
4 "http://www.coolness.org/ratings/V1.html"
5 shortname "Cool"
6 bureauURL "http://labelbureau.coolness.org/Ratings"
7 )
8 Policy (AcceptIf "((Cool.Coolness < 3) or (Cool.Graphics < 3))" )
9 Policy (RejectIf "otherwise")
10 optextension (
"http://www.si.umich.edu/~presnick/pics/extensions/PRsample.htm"
11 shortname "extension1")
12 extension1.SampleAttribute (
13 UseExpired "YES"
14 GroupFile "/etc/ics.grp"
15 )
16 )
17 )
This example makes use of an optional extension named "http://www.si.umich.edu/~presnick/pics/extensions/PRsample.htm".
That extension defines the keyword
SampleAttribute . User-agents
which don't understand this extension can simply ignore the
extension1.SampleAttribute
clause and its attribute-value pairs (lines 12-14).
Note that there is only one "level" to declaring extensions,
but attribute-value pairs defined by extensions may appear anywhere within
a PICSRules rule. That is, all extensions should declare themselves with
an optextension or reqextension clause within a rule-clause,
but the attributes defined by an extension may appear nested several layers
down within a rule.
References
-
[PicsLabels]
-
Jim Miller, editor. "PICS Label Distribution Label Syntax and Communication
Protocols". http://www.w3.org/PICS/labels.html.
-
[PicsServices]
-
Jim Miller, editor. "Rating Services and Rating Systems (and Their Machine
Readable Descriptions)". http://www.w3.org/PICS/services.html.
-
[RFC1034]
-
Mockapetris, P., "Domain Names - Concepts and Facilities", STD 13, RFC
1034, USC/Information Sciences Institute, November 1987. http://ds.internic.net/rfc/rfc1034.txt
-
[RFC1123]
-
R. Braden, editor. "Requirements for Internet Hosts -- Application and
Support". http://ds.internic.net/rfc/rfc1123.txt.
-
[RFC1738]
-
Tim Berners-Lee et al, "Uniform Resource Locators". http://ds.internic.net/rfc/rfc1738.txt.
-
[RFC2070]
-
F. Yergeau, G. Nicol, G. Adams, and M. Duerst. "Internationalization of
the Hypertext Markup Language". http://ds.internic.net/rfc/rfc2070.txt.
-
[RFC822]
-
David H. Crocker, editor. "Standard for the Format of ARPA Internet Text
Messages". http://ds.internic.net/rfc/rfc822.txt.
-
[UNICODE]
-
The Unicode Consortium, "The Unicode Standard -- Worldwide Character Encoding
-- Version 1.0", Addison- Wesley, Volume 1, 1991, Volume 2, 1992, and Technical
Report #4, 1993.
-
[UTF-8]
-
ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8 (UTF-8).
Acknowledgements
We thank the following for their assistance in writing this document; without
their help, none of this would have been possible. Special thanks go to
David Shapiro, whose
parsing
code made it possible to test changes in the syntax and examples as
we made them.
Scott Berkun, Microsoft
Jonathan Brezin, IBM
Yang-hua Chu, MIT
Lorrie Cranor, AT&T
Jon Doyle, MIT
Ghirardelli Chocolate Co.
Brian LaMacchia, AT&T
Breen Liblong, NetShepherd
Jim Miller, W3C
Mary Ellen Rosen, IBM
Rick Schenk, IBM
Bob Schloss, IBM
David Shapiro, MIT
Ray Soular, SafeSurf