db5
db5
db5
DISTRIBUTED DATABASES
A distributed database is a database in which not all storage devices are attached to a common processor. It
may be stored in multiple computers, located in the same physical location; or may be dispersed over a network of
interconnected computers.
• Distributed database is a system in which storage devices are not connected to a common processing unit.
• Database is controlled by Distributed Database Management System and data may be stored at the same
location or spread over the interconnected network. It is a loosely coupled system.
• Shared nothing architecture is used in distributed databases.
Data Replication
If relation r is replicated, a copy of relation r is stored in two or more sites. In the most extreme case, we have
full replication, in which a copy is stored in every site in the system.
There are a number of advantages and disadvantages to replication.
Availability If one of the sites containing relation r fails, then the relation r can be found in another site. Thus, the
system can continue to process queries involving r, despite the failure of one site.
Increased parallelism. In the case where the majority of accesses to the relation r result in only the reading of the
relation, then several sites can process queries involving r in parallel. The more replicas of r there are, the greater
the chance that the needed data will be found in the site where the transaction is executing. Hence, data replication
minimizes movement of data between sites. Increased overhead on update. The system must ensure that all replicas
of a relation r are consistent; otherwise, erroneous computations may result. Thus, whenever is updated, the update
must be propagated to all sites containing replicas. The result is increased overhead. For example, in a banking
system, where account information is replicated in various sites, it is necessary to ensure that the balance ina particular
account agrees in all sites.
Data Fragmentation
If relation r is fragmented, r is divided into a number of fragments r1, r2,...,rn. These fragments contain
sufficient information to allow reconstruction of the original relation r.
There are two different schemes for fragmenting a relation: horizontal fragmentation and vertical
fragmentation.
• Horizontal fragmentation splits the relation by assigning each tuple of r to one or more fragments.
• Vertical fragmentation splits the relation by decomposing the scheme R of relation r.
In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1, r2,...,rn. Each tuple of
relation r must belong to at least one of the fragments, so that the original relation can be reconstructed, if needed.
account1 = branch name = ―Hillside‖ (account)
account2 = branch name = ―Valleyview‖ (account)
Horizontal fragmentation is usually used to keep tuples at the sites where they are used the most, to minimize data
transfer.
In general, a horizontal fragment can be defined as a selection on the global relation r. That is, we use a
predicate Pi to construct fragment ri:
We reconstruct the relation r by taking the union of all fragments; that is:
r = r1 𝖴 r2 𝖴···𝖴 rn
Transparency
The user of a distributed database system should not be required to know where the data are physically located
nor how the data can be accessed at the specific local site. This characteristic, called data transparency, can take
several forms:
Fragmentation transparency. Users are not required to know how a relation has been fragmented.
Replication transparency. Users view each data object as logically unique. The distributed system may replicate an
object to increase either system performance or data availability. Users do not have to be concerned with what data
objects have been replicated, or where replicas have been placed.
Location transparency. Users are not required to know the physical location of the data. The distributed database
system should be able to find any data as long as the data identifier is supplied by the user transaction.
DISTRIBUTED TRANSACTIONS
There are two types of transaction that we need to consider.
• Local transactions are those that access and update data in only one local database;
• Global transactions are those that access and update data in several local databases.
System Structure
Each site has its own local transaction manager, whose function is to ensure the ACID properties of those
transactions that execute at that site. The various transaction managers cooperate to execute global transactions. To
understand how such a manager can be implemented, consider an abstract model of a transaction system, in which
each site contains two subsystems:
• The transaction manager manages the execution of those transactions (or sub transactions) that access data
stored in a local site.
• The transaction coordinator coordinates the execution of the various transactions (both local and global)
initiated at that site.
Each transaction manager is responsible for:
• Maintaining a log for recovery purposes.
• Participating in an appropriate concurrency-control scheme to coordinate the concurrent execution of the
transactions executing at that site.
The transaction coordinator subsystem is not needed in the centralized environment, since a transaction
accesses data at only a single site. A transaction coordinator, as its name implies, is responsible for coordinating the
execution of all the transactions initiated at that site. For each such transaction, the coordinator is responsible for:
• Starting the execution of the transaction.
• Breaking the transaction into a number of sub transactions and distributing these sub transactions to the
appropriate sites for execution.
• Coordinating the termination of the transaction, which may result in the transaction being committed at all sites or
aborted at all sites.
System Failure Modes
• Failure of a site.
• Loss of messages.
• Failure of a communication link.
• Network partition.
OBJECT-BASED DATABASES
An object-oriented database system is a database system that natively supports an object-oriented type system,
and allows direct access to data from an object-oriented programming language using the native typesystem of
the language.
Complex Data Types
Traditional database applications have conceptually simple datatypes. The basic data items are records that
are fairly small and whose fields are atomic.
In recent years, demand has grown for ways to deal with more complex data types. Consider, for example,
addresses. While an entire address could be viewed as an atomic data item of type string, this view would hide details
such as the street address, city, state, and postal code, which could be of interest to queries. On the other hand, if an
address were represented by breaking it into the components (street address, city, state, and postal code), writing
queries would be more complicated since they would have to mention each field. A better alternative is to allow
structured datatypes that allow a type address with subparts street address, city, state, and postal code.
Structured Types
Structured types allow composite attributes of E-R designs to be represented directly. For instance, we can
define the following structured type to represent a composite attribute name with component attribute firstname and
lastname:
Suppose we wish to record information about books, including a set of keywords for each book. Suppose also
that we wished to store the names of authors of a book as an array; unlike elements in a multiset, the elementsof an
array are ordered, so we can distinguish the first author from the second author, and so on. The following example
illustrates how these array and multiset-valued attributes can be defined in SQL:
create type Publisher as
(name varchar(20),
branch varchar(20));
create type Book as
(title varchar(20),
Autho_arrray varchar(20) array [10],
Pub_date date, publisher Publisher, keyword_set varchar(20) multiset);
create table books of Book;
The first statement defines a type called Publisher with two components: a name and a branch. The second
statement defines a structured type Book that contains a title, an author array, which is an array of up to 10 author
names, a publication date, a publisher (of type Publisher), and a multiset of keywords. Finally, a table books containing
tuples of type Book is created.
Object-Identity and Reference Types in SQL
Object-oriented languages provide the ability to refer to objects. An attribute of a type can be a reference to
an object of a specified type. For example, in SQL we can define a type Department with a field name and a field head
that is a reference to the type Person, and a table departments of type Department, as follows:
create type Department (
name varchar(20),
head ref(Person) scope people);
create table departments of Department;
Here, the reference is restricted to tuples of the table people. The restriction of the scope of a reference to
tuples of a table is mandatory in SQL, and it makes references behave like foreign keys.
Object-relational Features
Object-relational database systems are basically extensions of existing relational database systems. Changes
are clearly required at many levels of the database system. However, to minimize changes to the storage-system
code (relation storage, indices, etc.), the complex datatypes supported by object-relational systems can be translated
to the simpler type system of relational databases.
Sub tables can be stored in an efficient manner, without replication of all inherited fields, in one of two
ways:
• Each table stores the primary key (which may be inherited from a parent table) and the attributes that are
defined locally. Inherited attributes (other than the primary key) do not need to be stored, and can be derived
by means of a join with the super table, based on the primary key.
• Each table stores all inherited and locally defined attributes. When a tuple is inserted, it is stored only in the
table in which it is inserted, and its presence is inferred in each of the super tables. Access to all attributes of
a tuple is faster, since a join is not required.
An object has five aspects: identifier, name, lifetime, structure, and creation.
1. The object identifier is a unique system-wide identifier (or Object_id). Every object must have an object
identifier.
2. Some objects may optionally be given a unique name within a particular ODMS—this name can be used to
locate the object, and the system should return the object given that name. Obviously, not all individual objects
will have unique names. Typically, a few objects, mainly those that hold collections of objects of a particular
object type—such as extents—will have a name. These names are used as entry points to the database; that
is, by locating these objects by their unique name, the user can then locate other objects that are referenced
from these objects. Other important objects in the application may also have unique names, and it is possible
to give more than one name to an object. All names within a particular ODMS must be unique.
3. The lifetime of an object specifies whether it is a persistent object (that is, a database object) or transient
object (that is, an object in an executing pro-gram that disappears after the program terminates). Lifetimes are
independent of types—that is, some objects of a particular type may be transient whereas others may be
persistent.
4. The structure of an object specifies how the object is constructed by using the type constructors. The structure
specifies whether an object is atomic or not. An atomic object refers to a single object that followsa user-
defined type, such as Employee or Department. If an object is not atomic, then it will be composed of other
objects. For example, a collection object is not an atomic object, since its state will be a collection of other
objects. In the ODMG model, an atomic object is any individual user-defined object. All values of the basic
built-in data types are considered to be literals.
5. Object creation refers to the manner in which an object can be created. This is typically accomplished via
an operation new for a special Object_Factory interface.
In the object model, a literal is a value that does not have an object identifier. However, the value may have
a simple or complex structure.
There are three types of literals: atomic, structured, and collection.
1. Atomic literals correspond to the values of basic data types and are predefined. The basic data types of the
object model include long, short, and unsigned integer numbers (these are specified by the keywords
long, short, unsigned long, and unsigned short in ODL), regular and double precision floating point numbers
(float, double), Boolean values (boolean), single characters (char), character strings (string), and enumeration
types (enum), among others.
2. Structured literals correspond roughly to values that are constructed using the tuple constructor. The built-in
structured lit-erals include Date, Interval,Time, and Timestamp.
3. Collection literals specify a literal value that is a collection of objects or values but the collection itself does
not have an Object_id. The collections in the object model can be defined by the type generators
set<T>, bag<T>, list<T>, and array<T>, where T is the type of objects or values in the collection.28 Another
collection type is dictionary<K, V>, which is a collection of associations <K, V>, where each K is a key (a
unique search value) associated with a value V; this can be used to create an index on a collection of values
V.
The notation of ODMG uses three concepts: interface, literal, and class. Following the ODMGterminology,
we use the word behavior to refer to operations and state to refer to properties (attributes and relationships).
An interface specifies only behavior of an object type and is typically noninstantiable (that is, no objects are
created corresponding to an interface). Although an interface may have state properties (attributes and relationships)
as part of its specifications, these cannot be inherited from the interface. Hence, an interface serves to define operations
that can be inherited by other interfaces, as well as by classes that define the user-defined objects for a particular
application.
A class specifies both state (attributes) and behavior (operations) of an object type, and is instantiable. Hence,
database and application objects are typically created based on the user-specified class declarations that form a
database schema.
Finally, a literal declaration specifies state but no behavior. Thus, a literal instance holds a simple or
complex structured value but has neither an object identifier nor encapsulated operations.
ODL: OBJECT DEFINITION LANGUAGE
Object Definition Language (ODL) is the specification language defining the interface to object types
conforming to the ODMG Object Model. Often abbreviated by the acronym ODL.This language's purpose is to define
the structure of an Entity-relationship diagram.
Class Declarations
• interface < name > {elements = attributes, relationships, methods }
Element Declarations
• attribute < type > < name > ;
• relationship < rangetype > < name > ;
Method Example
• float gpa(in: Student) raises(noGrades) float = return type.
• in: indicates Student argument is read-only.
Other options: out, inout.
Relationships
• use inverse to specify inverse relationships
• at most one' semantics remain
• multiplicity
o if many-many between C and D, then use Set<D> and Set<C>, respectively
o if many-one from C to D, then use D in C and Set<C> in D
o if many-one from D to C, then use C in D and Set<D> in C
o if one-one between C and D, then use D and C, respectively
Datatypes
• basis
o atomic: integer, float, character, character string, boolean, and enumeration
o classes
• type constructors (can be composed to create complex types)
o set: Set<T>
o bag: Bag<T>
o list: List<T> (sequential access)
o array: Array<T,i> (random access)
o dictionary: Dictionary<T,S>
o structures
• difference between sets, bags, and lists
• rules for types and relationships
o type of a relationship is either a class type or a (single use of a) collection type constructor applied
to a class type' [FCDB]
o type of an attribute is built starting with atomic type(s)' [FCDB]
• relationship types cannot involve
o atomic types (e.g., Set<integer>),
o structures (e.g., Struct N {Movie field1, Star field2}, or
o two applications of collection types (e.g., Set<Array<Star, 10>>)
Similarities between E/R and ODL
• both support all multiplicities of relationships
• both support inheritance
Differences between E/R and ODL
Banking Example 1
slash.</…..>.
Complex elements are constructed from other elements hierarchically, whereas simple elements contain data
values. A major difference between XML and HTML is that XML tag names are defined to describe the meaning of
the data elements in the document, rather than to describe how the text is to be displayed. This makes it possible to
process the data elements in the XML document automatically by computer programs. Also, the XML tag (element)
names can be defined in another document, known as the schema document, to give a semantic meaning to the tag
names that can be exchanged among multiple users. In HTML, all tag names are predefined and fixed; that is why
they are not extendible.
It is possible to characterize three main types of XML documents:
• Data-centric XML documents. These documents have many small data items that follow a specific structure
and hence may be extracted from a structured database. They are formatted as XML documents in order to
exchange them over or display them on the Web. These usually follow a predefined schema that defines the
tag names.
• Document-centric XML documents. These are documents with large amounts of text, such as news articles
or books. There are few or no structured data elements in these documents.
• Hybrid XML documents. These documents may have parts that contain structured data and other parts that
are predominantly textual or unstructured. They may or may not have a predefined schema.
XML documents that do not follow a predefined schema of element names and corresponding tree structure
are known as schemaless XML documents. It is important to note that data-centric XML documents can be considered
either as semistructured data or as structured data
DOCUMENT TYPE DEFINITION (DTD)
The document type definition (DTD) is an optional part of an XML document. The main purpose of a DTD is
much like that of a schema: to constrain and type the information present in the document.
However, the DTD does not in fact constrain types in the sense of basic types like integer or string. Instead,
it constrains only the appearance of sub elements and attributes within an element. The DTD is primarily a list of rules
for what pattern of subelements may appear within an element.
Example of a DTD
Thus, in the DTD, a university element consists of one or more course, department, or instructor elements; the
operator specifies ―or‖ while the + operator specifies ―one or more.‖ Although not shown here, the ∗ operator is used
to specify ―zero or more,‖ while the? operator is used to specify an optional element (that is, ―zero or one‖). The
course element contains sub elements course id, title, dept name, and credits (in that order).
Similarly, department and instructor have the attributes of their relational schema defined as sub elements in
the DTD. Finally, the elements course id, title, dept name, credits, building, budget, IID, name, and salary are all
declared to be of type #PCDATA. The keyword #PCDATA indicates text data; it derives its name, historically, from
―parsed character data.‖ Two other special type declarations are empty, which says that the element has no contents,
and any, which says that there is no constraint on the sub elements of the element; that is, any elements, even those
not mentioned in the DTD, can occur as sub elements of the element. The absence of a declaration foran element is
equivalent to explicitly declaring the type as any.
XML SCHEMA
XML Schema defines a number of built-in types such as string, integer, decimal date, and boolean. In addition,
it allows user-defined types; these may be simple types with added restrictions, or complex types constructed using
constructors such as complex Type and sequence.
Note that any namespace prefix could be used in place of xs; thus we could replace all occurrences of ―xs:‖
in the schema definition with ―xsd:‖ without changing the meaning of the schema definition. All types defined by
XML Schema must be prefixed by this namespace prefix. The first element is the root element university, whose type
is specified to be University Type, which is declared later. The example then defines the types of elements department,
course, instructor, and teaches. Note that each of these is specified by an element with tag xs:element, whose body
contains the type definition.
The type of department is defined to be a complex type, which is further specified to consist of a sequence of elements
dept name, building, and budget. Any type that has either attributes or nested sub elements must be
specified to be a complex type. Alternatively, the type of an element can be specified to be a predefined type by the
attribute type; observe how the XML Schema types xs: string and xs: decimal are used to constrain the types of data
elements such as dept name and credits. Finally, the example defines the type University Type as containing zero or
more occurrences of each of department, course, instructor, and teaches. Note the use of ref to specify the occurrence
Attributes are specified using the xs:attribute tag. For example, we could have defined dept name as an attribute
by adding:
<xs:attribute name = ―dept name‖/>
within the declaration of the department element. Adding the attribute use = ―required‖ to the above attribute
specification declares that the attribute must be specified, whereas the default value of use is optional. Attribute
specifications would appear directly under the enclosing complex Type specification, even if elements are nested
within a sequence specification.
In addition to defining types, a relational schema also allows the specification of constraints. XML Schema
allows the specification of keys and key references, corresponding to the primary-key and foreign-key definition in
SQL. In SQL, a primary-key constraint or unique constraint ensures that the attribute values do not recur within the
relation. In the context of XML, we need to specify a scope within which values are unique and form a key. The
selector is a path expression that defines the scope for the constraint, and field declarations specify the elements or
attributes that form the key. To specify that dept name forms a key for department elements under the root university
element, we add the following constraint specification to the schema definition:
XML Schema offers several benefits over DTDs, and is widely used today. Among the benefits that we have seen in
the examples above are these:
• It allows the text that appears in elements to be constrained to specific types, such as numeric types in
specific formats or complex types such as sequences of elements of other types.
• It allows user-defined types to be created.
• It allows uniqueness and foreign-key constraints.
• It is integrated with namespaces to allow different parts of a document to conform to different schemas.
In addition to the features we have seen, XML Schema supports several other features that DTDs do not, such as
these:
• It allows types to be restricted to create specialized types, for instance by specifying minimum and
maximum values.
• It allows complex types to be extended by using a form of inheritance.
XQUERY
XPath allows us to write expressions that select items from a tree-structured XML document. XQuery permits
the specification of more general queries on one or more XML documents. The typical form of a query in XQuery is
known as a FLWR expression, which stands for the four main clauses of XQuery and has the following form:
FOR<variable bindings to individual nodes (elements)>
LET <variable bindings to collections of nodes (elements)>
WHERE <qualifier conditions>
RETURN<query result specification>
There can be zero or more instances of the FOR clause, as well as of the LET clause in a single XQuery. The
WHERE clause is optional, but can appear at most once, and the RETURN clause must appear exactly once. Let us
1. Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are variables.
2. The LET clause assigns a variable to a particular expression for the rest of the query. In this example, $d is
assigned to the document file name. It is possible to have a query that refers to multiple documents by
assigning multiple variables in this way.
3. The FOR clause assigns a variable to range over each of the individual items in a sequence. In our example,
the sequences are specified by path expressions. The $x variable ranges over elements that satisfy the path
expression $d/company/project[projectNumber = 5]/projectWorker. The $y variable ranges over elements that
satisfy the path expression $d/company/employee. Hence, $x ranges over projectWorker elements, whereas
$y ranges over employee elements.
4. The WHERE clause specifies additional conditions on the selection of items. In this example, the first
condition selects only those projectWorker elements that satisfy the condition (hours gt 20.0). The second
condition specifies a join condition that combines an employee with a projectWorker only if they have the
same ssn value.
5. Finally, the RETURN clause specifies which elements or attributes should be retrieved from the items that
satisfy the query conditions. In this example, it will return a sequence of elements each containing for
employees who work more that 20 hours per week on project number 5.
XQuery has very powerful constructs to specify complex queries. In particular, it can specify universal and
existential quantifiers in the conditions of a query, aggregate functions, ordering of query results, selection based on
position in a sequence, and even conditional branching. Hence, in some ways, it qualifies as a full-fledged
programming language.