XML Chunk Equality

1 Introduction

This finding attempts to provide an answer for the question “when are two XML chunks equal?” Taken narrowly, a chunk of XML is an information item (a document, element, attribute, etc.). Taken broadly, it is a set or sequence of information items (a set of documents, a sequence of elements, a heterogeneous sequence of items, etc.).

Many applications will require their own special-purpose algorithms, but this finding provides one general solution that attempts to balance utility and complexity.

Different applications can have very different notions of what constitutes identity or equality:

A digital signature application may need a canonical, bit-for-bit identical lexical representation of both the data and the markup.
A language runtime system may need to know that two variables refer to the same data structure in memory.
A semantic inference application may need to know that two representations have the same URI.
A message passing application may need to know if two distinct messages are “the same,” if they are structurally equivalent.

It is the latter class of equality that this finding attempts to address. Given two distinct XML structures, can we decide if they convey “the same information.”

We describe this equality in terms of the [XML Information Set (Second Edition)].

2 Infoset Equality

We define chunk equality in terms of the [XML Information Set (Second Edition)]. Similar definitions could be defined on top of the [XML Schema Part 1: Structures] Post-Schema Validation Infoset (PSVI), the [XML Path Language (XPath) Version 1.0] data model, or the [XQuery 1.0 and XPath 2.0 Data Model]. We choose the infoset because it is a common abstraction for XML specifications.

A few general notes about how the comparisons are performed:

Information items of different types (elements and attributes or comments and processing instructions) are never the same.
Ordered lists (such as the [children] property) are compared pairwise and in order. In other words, two ordered lists "A" and "B" are the same if and only if the first item if "A" is the same as the first item of "B", the second item of "A" is the same as the second item of "B", etc. It follows that they can only be the same if they are the same length.
Unordered lists (such as the [attributes] property) are compared pairwise but without respect to order. In other words, two unordered lists "A" and "B" are the same if and only if there exists a set of pairs of items, one from each list, such that the two items in each pair are equal and no item from "A" or "B" appears in more than one pair. It follows that they can only be the same if they are the same length.
XML Base. If the infosets being compared were constructed by an application that claims conformance to the XML Base recommendation, then the xml:base attribute is excluded from attribute comparisons.
In this specification, the base URI is not considered significant.
Natural Language. The xml:lang attribute is not treated specially in the Infoset but is intended to have a scoped effect much like the base URI. This intention is made explicit in this specification.
If the infosets being compared were constructed by an application that provides application semantics for xml:lang, then the application must be able to determine whether or not two elements or attributes have the same language.
If the infosets being compared were constructed by an application that does not provide special semantics for xml:lang, then two elements or attributes have the same language if they have the same inherited value for xml:lang.
The inherited value for xml:lang is the value of xml:lang on the element in question or the value from the closest ancestor. In XPath terms: (ancestor-or-self::*/@xml:lang)[last()]
Languages are compared case insensitively.
XML Space. This specification does not extend any special status to the xml:space attribute, nor does it treat whitespace marked as [element content whitespace] in any special way.
When two information items are compared:
- Properties with the value "no value" are equal.
- Properties with the value "unknown" are not equal.

2.1 Infosets

Two infosets are equal if and only if their root information items are equal.

The comparison explicitly ignores the XML version. The XML version has an impact on infoset construction (with respect to line-feed normalization, for example), but it is not necessary to consider it in infoset comparison. Element and attribute names and element content is the same if it is the same, regardless of how it was encoded.

2.2 Document Information Items

Two document information items are equal if the following properties are equal:

[children]
[document element]
[all declarations processed]

2.3 Element Information Items

Two element information items are equal if they have the same language and the following properties are equal:

[namespace name]
[local name]
[children]
[attributes], exclusive of xml:lang

2.4 Attribute Information Items

Two attribute information items are equal if they have the same language and the following properties are equal:

[namespace name]
[local name]
[normalized value]
[attribute type]

2.5 Processing Instruction Information Items

Two processing instruction information items are equal if the following properties are equal:

[target]
[content]

2.6 Unexpanded Entity Reference Information Items

Two unexpanded entity reference information items are equal if the following properties are equal:

[name]
[system identifier]
[public identifier]

2.7 Character Information Items

Two character information items are equal if the following properties are equal:

[character code]
[element content whitespace]

2.8 Comment Information Items

Two comment information items are equal if the following properties are equal:

[content]

2.9 Document Type Declaration Information Items

Two documen type declaration information items are equal if the following properties are equal:

[system identifer]
[public identifier]
[children]

2.10 Unparsed Entity Information Items

Two unparsed entity information items are equal if the following properties are equal:

[name]
[system identifer]
[public identifier]
[notation name]

3 Customizing the Comparison

The algorithm described in 2 Infoset Equality is very conservative. It could be made more permissive with the addition of a few parameters. For example, parameters could adjust the algorithm to do any or all of the following:

Ignore processing instructions.
Ignore comments.
Ignore the document type declaration.

Even if the algorithm remains conservative, applications can influence the results by choosing how the infoset is constructed. There is no single, normative way to construct an infoset.

A References

RFC 2119: S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. IETF. March, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
XML Schema Part 1: Structures: XML Schema Part 1: Structures, Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn, Editors. World Wide Web Consortium, 02 May 2001. This version is http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. The latest version is available at http://www.w3.org/TR/xmlschema-1/.
XQuery 1.0 and XPath 2.0 Data Model: XQuery 1.0 and XPath 2.0 Data Model, Marton Nagy, Norman Walsh, Mary Fernández, et. al., Editors. World Wide Web Consortium, 12 Nov 2003. This version is http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/. The latest version is available at http://www.w3.org/TR/xpath-datamodel/.
XML Information Set (Second Edition): XML Information Set (Second Edition), John Cowan and Richard Tobin, Editors. World Wide Web Consortium, 04 Feb 2004. This version is http://www.w3.org/TR/2004/REC-xml-infoset-20040204. The latest version is available at http://www.w3.org/TR/xml-infoset.
XML Path Language (XPath) Version 1.0: XML Path Language (XPath) Version 1.0, James Clark and Steven DeRose, Editors. World Wide Web Consortium, 16 Nov 1999. This version is http://www.w3.org/TR/1999/REC-xpath-19991116. The latest version is available at http://www.w3.org/TR/xpath.

B Examples (Non-Normative)

This appendix provides a few examples to help to clarify what we mean by “the same information.” Unless otherwise stated, there are no unshown, in-scope namespace bindings in any of these examples.

<element-one/>

attr="value"

These information items are different, they are not the same kind of information item.

<element-one/>

<element-two/>

These elements are different, they have different [local name]s.

<element xmlns="http://example.org/ns-one"/>

<element xmlns="http://example.org/ns-two"/>

These elements are different, they have different [namespace name]s.

<element attr1="value1"/>

<element attr1="value1" attr2="value2"/>

These elements are different, they have different [attributes].

<element attr1="value1"/>

<element attr1="a different value"/>

These elements are different, the have different attribute values.

<element attr1="value1" attr2='value2'/>

<element attr2="value2" attr1="value1"/>

These elements are the same, attribute quoting and order are insignificant in the infoset.

<element xmlns="http://example.org/ns"/>

<x:element xmlns:x="http://example.org/ns"/>

These elements are the same: namespace prefix bindings on element and attribute names are not significant in the infoset.

<x:element xmlns:x="http://example.org/ns" attr="x:name"/>

<y:element xmlns:y="http://example.org/ns" attr="y:name"/>

These elements are different, they have different attribute values. The infoset does not have attribute values of type “QName”, so it is not possible to determine if the attribute in this case actually contains a QName or if it just contains different characters. This specification compares the characters.

<element>Montréal</element>

<element>Montr&#233;al</element>

These elements are the same: encoding differences are not significant.

<element xml:lang="us-EN">
  <element>Some content.</element>
</element>

<element xml:lang="us-EN">
  <element xml:lang="us-en">Some content.</element>
</element>

These elements are the same: all of the elements have the same content and are in the same language.

<element xsi:type="xs:double">3.0</element>

<element xsi:type="xs:double">3</element>

These elements are different. The comparison is based on the infoset, not on properties of the PSVI, even if the content might be the same under some other interpretations.

<element>
  <element2>Some content.</element2>
</element>

<element><element2>Some content.</element2></element>

These elements are different. In the first case, the element has three children; in the second, it has only one.

<element>Some content.</element>

<element>Some
content.</element>

These elements are different. New line characters are not normalized in element content.

<element attr="one two">Some content.</element>

<element attr="one
two">Some content.</element>

These elements are the same. New line characters and whitespace are normalized in attribute content.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.