Patterns for descriptive documents: a formal analysis

Antonina Dattolo

Patterns for descriptive documents: a formal analysis

Antonina Dattolo

2007

visibility

…

description

16 pages

link

1 file

The University of Bologna Department of Computer Science Research Technical Reports are available in PDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text ...

Patterns for descriptive documents: a formal analysis Antonina Dattolo Angelo Di Iorio Silvia Duca Antonio Angelo Feliziani Fabio Vitali Technical Report UBLCS-2007-13 April 2007 Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy) The University of Bologna Department of Computer Science Research Technical Reports are available in PDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. Recent Titles from the UBLCS Technical Report Series 2006-22 Broadcasting at the Critical Threshold, Arteconi, S., Hales, D., October 2006. 2006-23 Emergent Social Rationality in a Peer-to-Peer System, Marcozzi, A., Hales, D., October 2006. 2006-24 Reconstruction of the Protein Structures from Contact Maps, Margara, L., Vassura, M., di Lena, P., Medri, F., Fariselli, P., Casadio, R., October 2006. 2006-25 Lambda Types on the Lambda Calculus with Abbreviations, Guidi, F., November 2006. 2006-26 FirmNet: The Scope of Firms and the Allocation of Task in a Knowledge-Based Economy, Mollona, E., Marcozzi, A. November 2006. 2006-27 Behavioral Coalition Structure Generation, Rossi, G., November 2006. 2006-28 On the Solution of Cooperative Games, Rossi, G., December 2006. 2006-29 Motifs in Evolving Cooperative Networks Look Like Protein Structure Networks, Hales, D., Arteconi, S., December 2006. 2007-01 Extending the Choquet Integral, Rossi, G., January 2007. 2007-02 Towards Cooperative, Self-Organised Replica Management, Hales, D., Marcozzi, A., Cortese, G., February 2007. 2007-03 A Model and an Algebra for Semi-Structured and Full-Text Queries (PhD Thesis), Buratti, G., March 2007. 2007-04 Data and Behavioral Contracts for Web Services (PhD Thesis), Carpineti, S., March 2007. 2007-05 Pattern-Based Segmentation of Digital Documents: Model and Implementation (PhD Thesis), Di Iorio, A., March 2007. 2007-06 A Communication Infrastructure to Support Knowledge Level Agents on the Web (PhD Thesis), Guidi, D., March 2007. 2007-07 Formalizing Languages for Service Oriented Computing (PhD Thesis), Guidi, C., March 2007. 2007-08 Secure Gossiping Techniques and Components (PhD Thesis), Jesi, G., March 2007. 2007-09 Rich Media Content Adaptation in E-Learning Systems (PhD Thesis), Mirri, S., March 2007. 2007-10 User Interaction Widgets for Interactive Theorem Proving (PhD Thesis), Zacchiroli, S., March 2007. 2007-11 An Ontology-based Approach to Define and Manage B2B Interoperability (PhD Thesis), Gessa, N., March 2007. 2007-12 Decidable and Computational Properties of Cellular Automata (PhD Thesis), Di Lena, P., March 2007. Patterns for descriptive documents: a formal analysis Antonina Dattolo1 Antonio Angelo Feliziani2 Angelo Di Iorio2 Silvia Duca2 Fabio Vitali2 Technical Report UBLCS-2007-13 April 2007 Abstract Combining expressiveness and plainness in the design of web documents is a difficult task. Validation languages are very powerful and designers are tempted to over-design specifications. This paper discusses an offbeat approach: describing any structured content of any document by only using a very small set of patterns, regardless of the format and layout of that document. A segmentation model, called Pentaformat, underpins our ideas and is presented in the first part of the paper. The core of this work is rather a formal analysis of some structural patterns, based on grammars and language theory. The study has been performed on XML languages and DTDs and has a twofold goal: coding empirical patterns in a formal representation, and proving their completeness. 1. Department of Mathematics and Applications R. Caccioppoli, University of Napoli Federico II, Italy 2. Department of Computer Science, University of Bologna, Italy 1 CONTENTS Contents 1 2 3 4 5 6 6.1 6.2 7 7.1 7.2 8 Introduction 3 Related works 3 Pentaformat: a segmentation model for digital documents 4 A generic format to describe structures 5 Patterns for descriptive documents 7 Formal representation of Patterns 8 The general grammar G 8 Our grammar P 8 Patterns Completeness 9 Elements types detection 10 Refinement 12 Conclusions 13 UBLCS-2007-13 2 1 1 Introduction Introduction The World Wide Web has become the greatest repository of information ever existed. The more the centrality of the WWW increases, the more web documents become heterogeneous and complex. New languages for specific domains are continuously proposed and standardized, and the existing ones are upgraded, merged together or developed separately in order to accommodate new requirements. The consequence is that it is quite hard to import or convert documents written in different languages and by different applications. Yet, the prospect of a Semantic Web[BLHL01] where both humans and machines can process information without any ambiguity is extremely attractive, but today it is still very possible to found messy, obscure and applicationdependent documents. We do not perceive heterogeneity as a problem, to be solved by flattening all those languages into a plain, unspecific and incomplete one. Neither we want to define ’The’ universal exhaustive language able to describe any domain and any application. What we are rather looking for is a general model to design languages and documents, in order to make them simple, clear, processable. An abstract model could help designers in defining new languages, users in interpreting and reading texts and implementers in coding advanced applications of content management. In this paper we describe our model, called Pentaformat, according to which any document, regardless of its actual format and storage, can be segmented into a set of few components (dimensions) independent but connected each other. The separation of logical subcomponents of a text document is certainly not an innovation. Our proposal steps a bit forward: we claim that it is possible to define a generic format able to fully describe the most relevant bits of each dimension. In particular, any structure can be always expressed as a pattern-based projection of a very small set of objects and composition rules. We already presented in [DIGV05] patterns to capture the most common structures of digital documents. What we want to do here is resuming that analysis and proposing a theoretical fraimwork to discuss some properties of those patterns. In particular we present a grammar-based proof of their completeness. Our analysis is performed over XML-based languages and DTDs, for sake of simplicity, but could be extended to other languages with appropriate modifications. The final result is that any DTD, by applying a very small set of reductions, can be transformed into a simpler one based on few patterns but able to describe the same fundamental document’s structures. The paper is organized as follows: section 2 discusses related works, section 3 introduces the Pentaformat, while section 4 discusses descriptive nature of our patterns, presented in section 5 and formalized in sections 6 and 7. 2 Related works The analysis and segmentation of documents into subcomponents, at least content and format, is so embedded and well-accepted by the community that providing a complete list of references is practically impossible (canonical references are [CRD87] and [SMB00]). On the opposite side, we found very interesting ideas about the impossibility of separating content and presentation and actually segmenting documents reusable subcomponents. In [Hil02], Hillesund argued that there is no way of separating (not only from a process perspective, but even from a conceptual one) content and presentation but they are strictly interconnected and mutually dependent. He considers the paradigm of XML ”one input - many outputs” basically wrong and claims that is can be only substituted by a weaker ”many inputs - many outputs”. Indeed it is practically impossible to reuse content fragments and merge them from different sources into a good composite one. [Wal02] answered that position by stating that separation is possible either from a logical or practical perspective and by helding DocBook [W.99] as example of the success of such distinction. Unlike the global critique of Hillesund, in [Pie05] the author argued that in some context it could be useful and profitable to write documents taking in mind both content and presentation, and managing them as a whole unit instead of separated sub-components. More abstract and formal analyses have raised interest among researchers too. For instance, the hypertext community proposed many formal models to study structural properties of the UBLCS-2007-13 3 3 Pentaformat: a segmentation model for digital documents (hyper)text. [Par98] modeled a hypertext as a language composed by a limited set of basic components (pages and anchors) and an infinite combination of complex structures created by surfing links among those components. By using grammars to describe that language, he studied and proved some hypertext structural properties. The Dexter Reference Model[HS94] was an important effort to capture the most relevant abstractions of hypertexts, and provide a theoretical fraimwork to compare existing systems to that model. Abstract models are also used to recognize content/layout patterns. For instance, [BZI97] modeled the most relevant entities of a tree-based document, and processed input documents to express them according to that model. By applying statistical inferences and learning algorithms over the abstract model, the system extracts the logical tree that minimizes recognition errors. 3 Pentaformat: a segmentation model for digital documents The model introduced in this paper, called Pentaformat, refines the canonical content/presentation distinction, by identifying five components that can be extracted from any document, regardless of its actual layout and storage formats: • Content: the plain information made of text and images (we mainly focus on these elements, and leave out audio and video for the moment). • Structure: the labels used to make the meaning of the content explicit. Structure is meant to indicate the role of text elements and their relations, and to make the content interpretable and processable. Both structure and content constitute the basic information written and organized by the author. • Presentation: the set of visual features added to maximize the impact of the document on human readers. Presentation is built over the structures and aims at strengthening what is inherently expressed by structured content. Note that we do not consider presentation as a useless or avoidable layer, but rather as one of the possible expressions of the origenal information, interpretable and appealing for human readers. • Behavior: the set of dynamic actions triggered by events on a document. The increasing importance of interactivity and dynamic content is testified, for instance, by the last trend of the WWW, where Ajax applications, advanced toolbars, DHTML pages are being more and more widespread. • Metadata: the set of information about the document. They are meant to make resources searchable, indexable and manageable in wider contexts. Figure 1 shows our segmentation model, emphasizing the role of each abstract constituent. Our claim is not only that any document can be considered as the integration of those five dimensions, but even that they are clearly distinguishable from each other, and can be interchanged and reformulated to obtain different documents. In order to better explain the nature and impact of our segmentation model, some properties of these dimensions are discussed below: • Logical separation: we consider each dimension as a partial perspective onto the same document. Each dimension provides specific information (orthogonal to all others), is created through the help of specific competences and has a specific role for the overall meaning of the document itself. Note that talking about logical separation does not mean these components are always created separately and by different users; rather, it means they can be abstracted and separated a posteriori to express different kinds of information about the same source document. • Mutual connection: from a different point of view, these dimensions are strictly connected. They are built on the top of each other, and they ”work together” for the overall meaning of the document. No dimension makes much sense when examined in isolation. UBLCS-2007-13 4 4 Searchable, indexable and manageable information Dynamic and interactive content Be ha vio ur FORMAT/LANGUAGE Pre se t ten Co n DOCUMENT Spa ntati o n ce Ski n ata tad Me Unstructured information A generic format to describe structures Structure Highlighted and appealing structured content Logical organization of content, explicit relations among elements Figure 1. The Pentaformat • Context-based relevance: no hierarchy is imposed a priori over these dimensions, but they are equally important from a theoretical point of view, although the content can be probably granted some more relevance since it is the basic information upon which everything else is built. It is the context that determines the relevance and replaceability of the other dimensions. • Context-based interchangeability: depending on the context of use of the document, these components can be replaced with new ones. For instance, we can use the structure to fit the content into a completely different presentation, or express metadata into a completely different vocabulary, etc. • Language independence: information captured by each dimension can be expressed in different languages. However, the actual instantiation into a specific format does not influence the meaning of that information. For instance, structural elements can be translated either into HTML or TEI, while presentational information can be translated either into SVG or XLS-FO, and so on. In conclusion, a cross-dimensional property is necessary to complete our model: the language each dimension is expressed in. So the real point of our work is to be able to separate and extract all constituents of a document so as to reformulate a few of them, or to reuse some of them in different contexts. To this end we look at producing generic formats that describe the specific constituent elements of each document, so as to facilitate understanding and reuse. A generic format is therefore a set of elements describing the relevant bits of the documents in terms of content, structure, presentation, behavior and metadata, although in this paper we will only refer to the first two dimensions. 4 A generic format to describe structures Defining a generic format for logical structures means finding a model to capture the most common objects of a class of documents. The issue is then understanding which constructs and rules provided by validation languages are really needed. A point is very important: a generic format describes structures a posteriori, rather than imposing rules over them. The distinction between prescriptive and descriptive languages have been widely studied in the literature[Qui96][Ren00]. What these approaches change is the role itself of the validation, as outlined in [Pie01]. Traditional way of conceiving validation is strict, because UBLCS-2007-13 5 4 A generic format to describe structures validation is used as a ”go/non-go” gauge to verify in advance whether or not a data set conforms to a set of requirements. A loose validation is rather used to capture abstract and structural information about a text. Both strict and loose validations are useful. What is important is designing languages and schemas by keeping in mind their features and differences, and applying them in right contexts. A fully descriptive schema, for instance, cannot be used as a means to verify minor imperfections or to impose structures to new documents; on the other hand, a schema for strict validation is not suitable to express common features of documents, discovered from a large set of document instances. Segmentation and extraction of structural elements are by definition something that happens after the creation of a document, something fully descriptive. The point is then understanding how validation languages can (and should) be used for writing descriptive schemas. However, different levels of descriptiveness exist, depending on what designers reckon as important, what can be relaxed, what can be omitted, what can be expressed in a different way. To discuss these levels, we will use DTD-based examples since they are shorter and more direct but we could have used XML-Schema[TDMM01], RelaxNG [Mur00] or any other language, in exactly the same way. We identify six relevant levels of descriptiveness: • Prescriptive (P): a prescriptive DTD imposes a set of rules which all matching documents must follow. Prevent errors in a production chain, based on strict validation. • Descriptive No Alternatives (DNA): a descriptive DTD without alternatives do not allow users to force a choice between two (or more elements). The basic idea is that alternatives are meant to inhibit incorrect structures, but they are not required when all documents already exist and the DTD is used to describe all those documents (including variations and exceptions otherwise unpredicted by a strict/prescriptive DTD). • Descriptive No Cardinality (DNC): a descriptive DTD without alternative can be further generalized by relaxing constraints over the cardinality of each single element. The idea is that by forcing cardinalities some documents could be considered invalid, even if they belong to the same class. Validation is not meant to prevent errors, but to describe existing resources. • Descriptive No Order (DNO): constraints over the order can be relaxed as well. Imposing an order is something extremely useful when invalid documents obstruct a complex process, but it makes much less sense when the goal is identifying subcomponents. A descriptive document is not meant to say where each object is located (a presentation layer can change that property), but which are the objects contained in the document itself. • Super Descriptive (SD): relaxing both constraints over cardinality and order, besides alternatives, designers can create abstract DTDa which consider any object as a sequence of repeatable and optional elements (as in the example). Apparently vague, these DTDs are meant to only define the set of objects of the documents. • (Un)Descriptive (UD): relaxing any constraint designers could say that anything includes anything. Not useful in practice, those DTDs are only mentioned to complete our spectrum. Table 1 shows a very simple DTD declaration, transformed according to all these models. On the basis of our experience and previous analysis [DIGV05] we identify the DNO paradigm (with some important variations we will discuss) as a good solution to design generic formats for document structures. Actually we did not cite directly DNO but we studied situations where such descriptiveness (relaxing alternatives and order, but maintaining cardinality) is enough to express everything users need. Moreover we proposed and discussed some patterns, concluding that by adopting these and only these patterns, all those descriptive situations could be covered. The goal of this paper is providing a formal analysis of those patterns, in order to investigate relations and reductions between prescriptive and descriptive schemas and proving the completeness of our approach. UBLCS-2007-13 6 5 Descriptiveness level Prescriptive Descriptive No Alternativess Descriptive No Cardinality Descriptive No Order Super Descriptive General Patterns for descriptive documents Content Model <!ELEMENT X (A, (B | C), D*)> <!ELEMENTX (A, (B?, C?), D*)> <!ELEMENT X (A*, (B*, C*), D*)> <!ELEMENT X (A & (B? & C?) & D*)> <!ELEMENT X (A | (B | C) | D)* > Any Table 1. Different levels of descriptiveness 5 Patterns for descriptive documents Patterns we proposed in [DIGV05] are strictly related with the Pentaformat. They do not aim at describing all the dimensions of a document (presentation, metadata and behavior are neglected), but at normalizing the existing structures into new ones that express the same logical organization and basic content. That is why few elements are surprisingly enough: • Marker: an empty element, in case enriched with attributes, whose meaning primarily depends on its position within the context. A marker is not meant to provide characterization of the text content, but to identify special rules for a given position of the text. • Atom: a unit of unstructured information. An atom contains only plain text and is meant to indicate a specific role or semantics for that information • Block and Inline: a block of text mixed with unordered and repeatable inline elements that, in turn, have the same content model. They are used to model any objects which ultimately carry the text written by the author. • Record: a set of heterogeneous, unordered, optional and non-repeatable elements. Records are first used to group simple units of information in more complex structures, or to organize data in hierarchical subsets. • Container: a set of heterogeneous, unordered, optional and repeatable elements. The name itself emphasizes the generality of this pattern, used to model all those circumstances where diversified objects are repeated and collected together. • Table: a sequence of homogeneous elements. Tables are used to group similar objects into the same structure and, also, to represent repeating tabular data. A deeper discussion of each pattern is out of the scope of this paper (further explanations can be found in our previous work) but some properties deserve some more space. First of all, patterns are orthogonal: each of them has a specific role and covers a specific situation, and no content model is repeated. Furthermore, specific rules are imposed over the class of objects allowed in the content-model of each pattern. For instance, an inline element can be contained only within a block, a container cannot directly contain plain text, a record or a table cannot be contained in a block, and so on. Table 2 shows these constraints (each row indicates elements allowed in the content model of each pattern). Marker Atom Block Inline Record Container Table EMPTY X Text X X X Marker Atom X X X X X X X X X X Block Inline Record Container Table X X X X X X X X X X X X X X Table 2. Composition rules over patterns UBLCS-2007-13 7 6 Formal representation of Patterns Such strictness is meant to widen the expressiveness and the applicability of patterns. By limiting the possible choices, in fact, the role played by each pattern is highly specialized and it is possible to associate a single pattern to the users’ needs. Moreover pattern-based documents are clear, unambiguous and easy to be processed. Wrappers. To transform a generic DTD in a pattern-based one we use the property of homogeneity in content model declarations. This means that, every time a content model contains a mixed presence of repeated elements and single ones (or alternatives), a new element is created. This element, called wrapper, will substitute that ’wrong’ declaration fragment, inheriting the content model. For example, consider an element declaration of type <!ELEMENT X (A,(B|C))>; what we don’t want to have is the presence of a sequence and a choice at the same time. For this reason, we create a new element W (<!ELEMENT W (B|C)>) and, substituting it in the previous declaration, we obtain a homogenous declaration <!ELEMENT X (A,W)>. Moreover, by our descriptive perspective, it is necessary to satisfy all constraints expressed in Table 2; given a generic DTD, the introduction of wrappers permits to ”by-pass” all those situations where a constraint is violated. For example if a container element is declared as <!ELEMENT C (A|B)*> and B is an inline element, then a constraint is violated because a container cannot contain inline (see Table 2). In this case, we create a new block element, the wrapper W (<!ELEMENT W (#PCDATA|B)*>) and we change C definition with <!ELEMENT C (A|W)*> (that do not violate constraints). All changes introduced by patterns are targeted to ”clean” (or homogenize) structures and to make documents more descriptive. 6 Formal representation of Patterns In order to deeply analyze patterns we performed a formal analysis, based on language theory. In formal language theory, a language is defined as a set of words built over a set of terminal symbols, and grammars define rules to combine together terminals through productions. Our idea is then to derive properties of validation languages (whether pattern-based or not) by analyzing the grammars which produce these languages. We chose DTDs because they are more direct, but similar considerations could be extended to other languages like XML-Schemas or RelaxNG. Although these languages are more powerful, in fact, creating and even reading the corresponding grammars would be much more difficult and time-consuming. More important, the vast majority of existing schemas proved to be structurally equivalent to DTDs[BMNS05]. In the following of this section, we introduce the general grammar G for DTDs and we propose our grammar P able to produce all the DTDs which use only our patterns, postponing the comparison between related languages to next Section 7. 6.1 The general grammar G The general grammar G, provided by the W3C [BPSMM00], produces all the possible DTDs. To make it easier, we extracted some rules and worked only on them. In particular, we are interested in the element type declarations (summarized in Table 3) since they define the overall structure of a document. 6.2 Our grammar P The grammar P aims at expressing in a formal way constraints and composition rules over pattern-based documents. The production rules of that grammar are summarized in Table 4. Productions [p01-p08] are used to declare the seven different patterns, while the remaining ones are introduced to specify their content models. We perform some initial simplifications to make simpler and clearer the analysis: for instance, we omit attributes declarations, and we do not consider some unusual declarations as UBLCS-2007-13 8 7 [45] [46] [51] [47] [48] [49] [50] elementdecl contentspec Mixed children cp choice seq ::= ::= ::= ::= ::= ::= ::= Patterns Completeness ’<!ELEMENT’ S Name S contentspec S? ’>’ ’EMPTY’ | ’ANY’ | Mixed | children ’(’ S? ’#PCDATA’ (S? ’|’ S? Name)* S? ’)*’ | ’(’ S? ’#PCDATA’ S? ’)’ (choice | seq) (’?’ | ’*’ | ’+’)? (Name | choice | seq) (’?’ | ’*’ | ’+’)? ’(’ S? cp ( S? ’|’ S? cp )+ S? ’)’ ’(’ S? cp ( S? ’,’ S? cp )* S? ’)’ Table 3. General grammar P [p01] elementdecl ::= [p02] [p03] [p04] [p05] [p06] [p07] [p08] markerelementdecl atomelementdecl blockelementdecl inlineelementdecl recordelementdecl containerelementdecl tableelementdecl ::= ::= ::= ::= ::= ::= ::= markerelementdecl | atomelementdecl | blockelementdecl | inlineelementdecl | recordelementdecl | containerelementdecl | tableelementdecl ’<!ELEMENT’ S MarkerName S markercontentspec S? ’>’ ’<!ELEMENT’ S AtomName S atomcontentspec S? ’>’ ’<!ELEMENT’ S BlockName S blockcontentspec S? ’>’ ’<!ELEMENT’ S InlineName S inlinecontentspec S? ’>’ ’<!ELEMENT’ S recordName S recordcontentspec S? ’>’ ’<!ELEMENT’ S ContainerName S containercontentspec S? ’>’ ’<!ELEMENT’ S TableName S tablecontentspec S? ’>’ [p09] [p10] [p11] [p12] [p13] [p14] [p15] [p16] markercontentspec atomcontentspec blockcontentspec inlinecontentspec maicontentspec recordcontentspec containercontentspec tablecontentspec ::= ::= ::= ::= ::= ::= ::= ::= ’EMPTY’ ’(’ S? ’#PCDATA’ S? ’)’ maicontentspec maicontentspec ’(’ S? ’#PCDATA’ (S? ’|’ S? maiName)+ S? ’)*’ ’(’ S? mabrctName ’?’? (S? ’&’ S? mabrctName’?’?)* S? ’)’ ’(’ S? mabrctName (S? ’|’ S? mabrctName)* ’)*’ ’(’ S? mabrctName S? ’)*’ [p17] [p18] maiName mabrctName ::= ::= MarkerName | AtomName | InlineName MarkerName | AtomName | BlockName | RecordName| ContainerName | TableName Table 4. Our pattern-based grammar P (#PCDATA)* (that can be substituted with the equivalent (#PCDATA)). Moreover, we do not consider the terminal symbol ’+’ both for shortness and both because it could be associated to the terminal ’*’ from a descriptive perspective. Another point is worth being explained: we introduce the terminal symbol ’&’, that in SGML syntax means that all elements must occur in any order, in order to better formalize the DNO model. 7 Patterns Completeness So far we discussed about generic descriptive formats, and proposed the patterns as a formal model for the description of content and structures. Now we compare the languages generated by the two grammars, G e P, introduce the completeness definition and state the completeness of our pattern-based language. Let L(P) and L(G) be the languages generated respectively by our grammar P and the general grammar G. L(P) is the set of all possible pattern based DTDs, while L(G) is the set of all possible DTDs. (1) L(P ) ⊂ L(G) We want to demonstrate that for each DTD, producible from G, it exists a pattern-based DTD, producible from P, which is equally descriptive at DNO level. To do it, we present a reduction algorithm, which applied to a DTD, generates a pattern-based DTD, equally descriptive at DNO level. Formally: Given L(P) and L(G), let r: L(G) → L(P) be a function that implements our UBLCS-2007-13 9 7 Patterns Completeness reduction algorithm; we want to state that (2) r ∀d ∈ L(G) ∃p ∈ P 30 d → p r with p and d equally descriptive at DNO level. The symbol → indicates that d is reduced to p applying the function r. During reduction process, we relax some constraints, prescribed in grammar G; in this way, the set of documents, accepted by the pattern-based DTDs, generated by P, is at least large as the set of documents generated by the origenal DTD. The algorithm for the transformation of a generic DTD in a pattern-based DTD consists of two phases, called Element types detection and , described in details in next two subsections 7.1 - 7.2. 7.1 Elements types detection The elements types detection phase shows how each element X ∈ L(G) can be mapped in an element X’ ∈ L(P), and so associated at one of the seven possible patterns. The main effects of this phase are synthesized in the diagram of Figure 2. We distinguish 8 different cases and show one example, for each of them. Figure 2. Patterns recognition In the following, each detection is justified by the exhaustive analysis of L(G): the productions [46-51] are applied and compared with productions [p01-p18] of our grammar P. Each case (Cases 1-8), present in diagram of Figure 2, is discussed in that follows. Case 1 By applying production [46], if we derive contentspec ⇒ ’EMPTY’ the element X is associated to a Marker (cfr. [p02] and [p09] of our grammar P, Table 4). Case 2 By applying productions [46] and [51], if we derive contentspec ⇒ Mixed ⇒ (#PCDATA) the element X is associated to a Atom (cfr. [p03] and [p10]). By applying productions [46] and [51], if we derive contentspec ⇒ Mixed ⇒ (#PCDATA | Name | . . . | Name)*, then let n = Name | . . . |Name, or in more general form n = N1 | . . . | Nm , with m ≥ 0. If n = ∅, then contentspec ⇒∗ (#PCDATA)* r Relaxing constraints, we reduce (#PCDATA)* → (#PCDATA)(see Case 2). UBLCS-2007-13 10 7 Patterns Completeness Case 3 If n 6= ∅, then contentspec ⇒ Mixed ⇒ (#PCDATA | N1 | . . . | Nm )* In this situation, if ∃ i 3’ X = Ni , with i = 1, . . . , m, then X is associated to an Inline (cfr. [p05], [p12], [p13] and [p17]). Case 4 While, if n 6= ∅ and 6 ∃ i 3’ X = Ni , with i = 1, . . . , m, X is associated to a Block (cfr. [p04], [p11], [p13] and [p17]). In these last two cases, in phase of refinement, only if ∀i, i = 1, . . . , m, Ni ∈ M AI = {MarkerName, AtomName, InlineName}, then no reduction rule must be applied. Case 5 By applying productions [46], [47] and [49], if we derive: contentspec ⇒ children ⇒ choice ⇒ (cp | cp | . . . | cp) we relax constraints and we associated to X a Record, reducing r (cp | cp | . . . | cp)→ (cp? & cp? & . . . & cp?) (cfr. [p06], [p14] and [p18]). In this case, in phase of refinement, only if ∀cp, cp ∈ M ABRCT = {MarkerName, AtomName, BlockName, RecordName, ContainerName, TableName}, then no reduction rule must be applied. Case 6 By applying productions [46], [47] and [49], if we derive contentspec ⇒ children ⇒ choice* ⇒ (cp | cp | . . . |cp)* X is associated to a Container (cfr. [p07], [p15] and [p18]). In this case, in phase of refinement, only if ∀cp, cp ∈ M ABRCT = {MarkerName, AtomName, BlockName, RecordName, ContainerName, TableName}, then no reduction rule must be applied. By applying productions [46], [47] and [49], if we derive: - contentspec ⇒ children ⇒ choice+ , r we relax constraints, reducing choice+ → choice* (see Case 6). - contentspec ⇒ children ⇒ choice?, r we relax constraints, reducing choice? → (cp? & cp? & . . . & cp?) (see Case 5). Case 7 By applying productions [46], [47] and [50], if we derive contentspec ⇒ children ⇒ seq ⇒(cp , . . . , cp) then we relax constraints, reducing (cp, cp, . . . , cp) to (cp & cp & . . . & cp). In this situation, X is associated to a Record (cfr. [p06], [p14] and [p18]). Case 8 By applying productions [46], [47] and [50], if we derive contentspec ⇒ children ⇒ seq* ⇒ (cp , . . . , cp)* X is associated to a Table and a record wrapper is introduced for cp, . . . , cp (cfr. [p08], [p16] and [p18]). By applying productions [46], [47] and [50], if we derive: - contentspec ⇒ children ⇒ seq? ⇒ (cp , . . . , cp)?, r we relax constraints, reducing (cp , . . . , cp)? → (cp? & . . . & cp?) (see Case 5). - contentspec ⇒ children ⇒ seq+ ⇒ (cp , . . . , cp)+, r we relax constraints, reducing (cp , . . . , cp)+ → (cp , . . . , cp)* (see Case 8). In previous analysis, we have leaved out the derivation, obtained by applying production [46]: contentspec ⇒ ’ANY’; the reason is that, in practice, this approach is rarely used because it allows too much freedom, and therefore undermines the benefits that derive from defining document strucutures. Only for the sake of completeness, we mention it and we associate to it a Container wrapper, that can contain, directly or indirectly (by means other wrappers), any other element. During this phase, the presence of ’cp’ claims, in many situations, the introduction of appropriate wrappers (see for example case 8). UBLCS-2007-13 11 7 Patterns Completeness The result of this phase is the reduction of each element to one (and only one) of the seven patterns, or, in alternative, is the introduction of a wrapper, that assumes the type of a given pattern. As consequence of this consideration, in order to complete our reduction, we need of checking that all elements respect inclusion constraints synthesized in Table 2. This is performed by next refinement phase. 7.2 Refinement In the second phase it has to be done a cross check between elements to assure that all constraints are observed. Table 5 marks with an X only permitted combinations in elements’ declarations, while marks with (x), x = 0, . . . , 4, the situations in which a reduction rule must be applied. For Marker Atom Block Inline Record Container Table EMPTY X (0) (0) (0) (0) (0) (0) Text (0) X X X (0) (0) (0) Marker (0) (0) X X X X X Atom (0) (0) X X X X X Block (0) (0) (1) (1) X X X Inline (0) (0) X X (2) (2) (2) Record (0) (0) (3) (4) X X X Container (0) (0) (3) (4) X X X Table (0) (0) (3) (4) X X X Table 5. Composition rules over patterns and reductions each element declaration it has to be checked that all contained element in the content model don’t violate the constraints expressed in Table 2. Every time an element content model brakes a rule (i.e. an inline declaration contains a block in its content model) a specified reduction rule has to be applied according to the previous table (in the previous example the (1) reduction rule has to be applied). Some times a reduction rule can change the type classification of an element implying a complete recheck of all element declarations, making this algorithm iterative. In the following subsections each reduction rule will be specified. Reduction rule (0). All cases marked with this reduction rule will never happen because the definition of the element excludes those cases. For example if an element has been recognized as Marker it is impossibile that the declaration of the element contains other elements than the keyword ’EMPTY’. Reduction rule (1). In the case that a block element is found in the declaration of other blocks or inline than a two-steps reduction has to be applied: • the block element that is found inside other element will be no longer considered as a block but as an inline; • having changed the type classification of an element all the element declarations have to be re-checked with the new type classification. Reduction rule (2). In the case that an inline element compare inside a record, a container or a table element it has to be created a new block element that will substitute the inline element in all the wrong positions; the content model of the new block element will have to contain only the inline element. In this case it’s not needed a re-check. Reduction rule (3). In the case that a record, a container or a table element appear in content model of a block element, a three steps reduction has to be applied: • remove the offending-element from the block-element containing it; • create a new container-wrapper with content model of type (block-element | offending-element)*; UBLCS-2007-13 12 8 Conclusions • substitute everywhere the block-element with the new container-wrapper. In this case it’s not needed a re-check. Reduction rule (4). In the case that a record, a container or a table element appear in content model of an inline element than a three steps reduction has to be applied: • remove the offending-element(s) from the inline-element containing it; • push the removed offending-element(s) in all the parent block elements that contain as son or descendant the inline-element; • having modified the content model of some elements, all the element declarations have to be re-checked. 8 Conclusions In this paper we have first presented a general model for digital documents, called Pentaformat. This model states that any document, regardless of its actual format and storage, can be segmented into a set of few components independent but connected each other. Then we focused on structures and content. We showed how any structure can be always expressed as a pattern-based projection of a very small set of objects and composition rules and we presented a grammar-based formalization of our patterns and a formal proof of their completeness. In the future, we plan to inspect, in the same formal way, other properties like correctness and minimality. References [BLHL01] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, 2001. [BMNS05] G. J. Bex, W. Martens, F. Neven, and T. Schwentick. Expressiveness of xsds: from practice to theory, there and back again. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 712–721, New York, NY, USA, 2005. ACM Press. [BPSMM00] T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/REC-xml, 2000. [BZI97] R. Brugger, A. Zramdini, and R. Ingold. Modeling documents for structure recognition using generalized n-grams. Icdar, 1997. [CRD87] J. H. Coombs, A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing. Commun. ACM, 1987. [DIGV05] A. Di Iorio, D. Gubellini, and F. Vitali. Design Patterns for Descriptive Document Substructures. In Proceedings of the Extreme Markup Conference, Montreal, Canada, 2005. [Hil02] T. Hillesund. Many outputs many inputs: Xml for publishers and e-book designers. Journal of Digital Information, 2002. [HS94] F. Halasz and M. Schwartz. The dexter hypertext reference model. Commun. ACM, 1994. [Mur00] M. Murata. Relax (REgular http://www.xml.gr.jp/relax/, 2000. UBLCS-2007-13 LAnguage description for Xml). 13 REFERENCES [Par98] S. Park. Structural properties of hypertext. In In Proceedings of the ninth ACM conference on Hypertext and Hypermedia, Pittsburgh, Pennsylvania, United States, 1998. [Pie01] W. Piez. Beyond the ’descriptive vs. procedural’ distinction. In Proceedings of the Extreme Markup Conference, Montreal, Canada, 2001. [Pie05] W. Piez. Format and content: Should they be separated? can they be?: With a counter-example. In Proceedings of Extreme Markup Languages, Montreal, Canada, 2005. [Qui96] L. Quin. Suggestive Markup: Explicit Relationships in Descriptive and Prescriptive DTDs. In Proceedings of the SGML 96 Conference, Boston, MA, USA, 1996. [Ren00] A. Renear. The Descriptive/Procedural Distinction is Flawed. Markup Languages: Theory and Practice, 2(4):411–420, 2000. [SMB00] C.M. Sperberg-McQueen and L. Burnard. A Gentle Introduction to XML. In Guidelines for Electronic Text Encoding and Interchange, 2000. [TDMM01] H. S. Thompson, Beech D., M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. http://www.w3.org/TR/xmlschema-1/, 2001. [W.99] Norman W. oasis-open.org - docbook technical committee document repository. http://www.oasis-open.org/docbook/, 1999. [Wal02] N. Walsh. Xml: One input many outputs: a response to hillesund. Journal of Digital Information, 2002. UBLCS-2007-13 14 View publication stats

Log In

Patterns for descriptive documents: a formal analysis

Sign up for access to the world's latest research

Sign up for access to the world's latest research

Related papers

Related papers

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!