This document is also available in these non-normative formats: XML.
Copyright © 2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This finding attempts to provide an answer for the question “when are two XML chunks equal?” Many applications will require their own special-purpose algorithms, but this finding provides one solution that attempts to balance utility and complexity.
This Finding has been abandoned by the TAG. The issue will be closed without further action.
Since this issue was opened, the XSL and XML Query Working Groups
have published XQuery 1.0
and XPath 2.0 Functions and Operators. This document includes
a definition for fn:deep-equals
which compares two chunks
of XML. The TAG concludes that this definition suffices for the general
case that this Finding was attempting to address.
1 Introduction
2 Infoset Equality
2.1 Infosets
2.2 Document Information Items
2.3 Element Information Items
2.4 Attribute Information Items
2.5 Processing Instruction Information Items
2.6 Unexpanded Entity Reference Information Items
2.7 Character Information Items
2.8 Comment Information Items
2.9 Document Type Declaration Information Items
2.10 Unparsed Entity Information Items
3 Customizing the Comparison
A References
B Examples (Non-Normative)
This finding attempts to provide an answer for the question “when are two XML chunks equal?” Taken narrowly, a chunk of XML is an information item (a document, element, attribute, etc.). Taken broadly, it is a set or sequence of information items (a set of documents, a sequence of elements, a heterogeneous sequence of items, etc.).
Many applications will require their own special-purpose algorithms, but this finding provides one general solution that attempts to balance utility and complexity.
Different applications can have very different notions of what constitutes identity or equality:
A digital signature application may need a canonical, bit-for-bit identical lexical representation of both the data and the markup.
A language runtime system may need to know that two variables refer to the same data structure in memory.
A semantic inference application may need to know that two representations have the same URI.
A message passing application may need to know if two distinct messages are “the same,” if they are structurally equivalent.
It is the latter class of equality that this finding attempts to address. Given two distinct XML structures, can we decide if they convey “the same information.”
We describe this equality in terms of the [XML Information Set (Second Edition)].
We define chunk equality in terms of the [XML Information Set (Second Edition)]. Similar definitions could be defined on top of the [XML Schema Part 1: Structures] Post-Schema Validation Infoset (PSVI), the [XML Path Language (XPath) Version 1.0] data model, or the [XQuery 1.0 and XPath 2.0 Data Model]. We choose the infoset because it is a common abstraction for XML specifications.
A few general notes about how the comparisons are performed:
Information items of different types (elements and attributes or comments and processing instructions) are never the same.
Ordered lists (such as the [children] property) are compared pairwise and in order. In other words, two ordered lists "A" and "B" are the same if and only if the first item if "A" is the same as the first item of "B", the second item of "A" is the same as the second item of "B", etc. It follows that they can only be the same if they are the same length.
Unordered lists (such as the [attributes] property) are compared pairwise but without respect to order. In other words, two unordered lists "A" and "B" are the same if and only if there exists a set of pairs of items, one from each list, such that the two items in each pair are equal and no item from "A" or "B" appears in more than one pair. It follows that they can only be the same if they are the same length.
XML Base. If the infosets being compared were constructed by an application that claims conformance to the XML Base recommendation, then the xml:base attribute is excluded from attribute comparisons.
In this specification, the base URI is not considered significant.
Natural Language. The xml:lang attribute is not treated specially in the Infoset but is intended to have a scoped effect much like the base URI. This intention is made explicit in this specification.
If the infosets being compared were constructed by an application that provides application semantics for xml:lang, then the application must be able to determine whether or not two elements or attributes have the same language.
If the infosets being compared were constructed by an application that does not provide special semantics for xml:lang, then two elements or attributes have the same language if they have the same inherited value for xml:lang.
The inherited value for xml:lang is the value of xml:lang on the
element in question or the value from the closest ancestor. In XPath
terms: (ancestor-or-self::*/@xml:lang)[last()]
Languages are compared case insensitively.
XML Space. This specification does not extend any special status to the
xml:space
attribute, nor does it treat whitespace marked as
[element content whitespace] in any special way.
When two information items are compared:
Properties with the value "no value" are equal.
Properties with the value "unknown" are not equal.
Two infosets are equal if and only if their root information items are equal.
The comparison explicitly ignores the XML version. The XML version has an impact on infoset construction (with respect to line-feed normalization, for example), but it is not necessary to consider it in infoset comparison. Element and attribute names and element content is the same if it is the same, regardless of how it was encoded.
Two document information items are equal if the following properties are equal:
[children]
[document element]
[all declarations processed]
Two element information items are equal if they have the same language and the following properties are equal:
[namespace name]
[local name]
[children]
[attributes], exclusive of xml:lang
Two attribute information items are equal if they have the same language and the following properties are equal:
[namespace name]
[local name]
[normalized value]
[attribute type]
Two processing instruction information items are equal if the following properties are equal:
[target]
[content]
Two unexpanded entity reference information items are equal if the following properties are equal:
[name]
[system identifier]
[public identifier]
Two character information items are equal if the following properties are equal:
[character code]
[element content whitespace]
Two comment information items are equal if the following properties are equal:
[content]
The algorithm described in 2 Infoset Equality is very conservative. It could be made more permissive with the addition of a few parameters. For example, parameters could adjust the algorithm to do any or all of the following:
Ignore processing instructions.
Ignore comments.
Ignore the document type declaration.
Even if the algorithm remains conservative, applications can influence the results by choosing how the infoset is constructed. There is no single, normative way to construct an infoset.
This appendix provides a few examples to help to clarify what we mean by “the same information.” Unless otherwise stated, there are no unshown, in-scope namespace bindings in any of these examples.
<element-one/> attr="value"
These information items are different, they are not the same kind of information item.
<element-one/> <element-two/>
These elements are different, they have different [local name]s.
<element xmlns="http://example.org/ns-one"/> <element xmlns="http://example.org/ns-two"/>
These elements are different, they have different [namespace name]s.
<element attr1="value1"/> <element attr1="value1" attr2="value2"/>
These elements are different, they have different [attributes].
<element attr1="value1"/> <element attr1="a different value"/>
These elements are different, the have different attribute values.
<element attr1="value1" attr2='value2'/> <element attr2="value2" attr1="value1"/>
These elements are the same, attribute quoting and order are insignificant in the infoset.
<element xmlns="http://example.org/ns"/> <x:element xmlns:x="http://example.org/ns"/>
These elements are the same: namespace prefix bindings on element and attribute names are not significant in the infoset.
<x:element xmlns:x="http://example.org/ns" attr="x:name"/> <y:element xmlns:y="http://example.org/ns" attr="y:name"/>
These elements are different, they have different attribute values. The infoset does not have attribute values of type “QName”, so it is not possible to determine if the attribute in this case actually contains a QName or if it just contains different characters. This specification compares the characters.
<element>Montréal</element> <element>Montréal</element>
These elements are the same: encoding differences are not significant.
<element xml:lang="us-EN"> <element>Some content.</element> </element> <element xml:lang="us-EN"> <element xml:lang="us-en">Some content.</element> </element>
These elements are the same: all of the elements have the same content and are in the same language.
<element xsi:type="xs:double">3.0</element> <element xsi:type="xs:double">3</element>
These elements are different. The comparison is based on the infoset, not on properties of the PSVI, even if the content might be the same under some other interpretations.
<element> <element2>Some content.</element2> </element> <element><element2>Some content.</element2></element>
These elements are different. In the first case, the element
has
three children; in the second, it has only one.
<element>Some content.</element> <element>Some content.</element>
These elements are different. New line characters are not normalized in element content.
<element attr="one two">Some content.</element> <element attr="one two">Some content.</element>
These elements are the same. New line characters and whitespace are normalized in attribute content.