0% found this document useful (0 votes)
4 views206 pages

Advanced Rdbms

The document outlines the curriculum for an Advanced RDBMS course at Annamalai University, focusing on Object-Oriented Databases (OODBs) and their concepts, including object identity, structure, and encapsulation. It discusses the evolution of ODBMS, their advantages over traditional relational databases, and the challenges they face, such as interoperability and performance for general-purpose queries. The document also highlights the importance of object persistence, type hierarchies, and the encapsulation of operations in OODBs.

Uploaded by

Rathi Sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views206 pages

Advanced Rdbms

The document outlines the curriculum for an Advanced RDBMS course at Annamalai University, focusing on Object-Oriented Databases (OODBs) and their concepts, including object identity, structure, and encapsulation. It discusses the evolution of ODBMS, their advantages over traditional relational databases, and the challenges they face, such as interoperability and performance for general-purpose queries. The document also highlights the importance of object persistence, type hierarchies, and the encapsulation of operations in OODBs.

Uploaded by

Rathi Sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

DMSCSE22

ANNAMALAI UNIVERSITY
DIRECTORATE OF DISTANCE EDUCATION

M.Sc COMPUTER SCIENCE


SECOND SEMESTER

ADVANCED RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Copyright Reserved
(For Private Circulation Only)
Advanced RDBMS

UNIT - I

Topics:
 Concepts for Object – Oriented Databases
 Object identity, Object Structure and Type Constructors
 ODMG (Object Data Management Group)
 Object Definition Language (ODL)
 Object Query Language (OQL)
 Overview of C++ Language Binding
 Object Database Conceptual Design
 Overview of the CORBA Standard for Distributed Objects
 Object Relational and Extended Relational Database
Systems:
 The Informix Universal Server
 Object Relational features of Oracle 8
 An Overview of SQL
ementation & related Issues for Extended type
1.0 Introduction
Systems 36
1.2.18 The Nested Relational Data Model
Information is represented in object-oriented database, in the form of objects as used in
36
Object-Oriented Programming. When database capabilities are combined with object
programming language capabilities, the result is an object database management system
(ODBMS). An ODBMS makes database objects appear as programming language objects
in one or more object programming languages. An ODBMS supports the programming
language with transparently persistent data, concurrency control, data recovery,
associative queries, and other capabilities.

Object database management systems grew out of research during the early to mid-1980s
into having intrinsic database management support for graph-structured objects. The term
"object-oriented database system" first appeared around 1985.

Object database management systems added the concept of persistence to object


programming languages. The early commercial products were integrated with various
languages: GemStone (Smalltalk), Gbase (Lisp), and Vbase (COP). COP was the C
Object Processor, a proprietary language based on C that pre-dated C++. For much of the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
1990s, C++ dominated the commercial object database management market. Vendors
added Java in the late 1990s and more recently, C#.

Starting in 2004, object databases have seen a second growth period when open source
object databases emerged that were widely affordable and easy to use, because they are
entirely written in OOP languages like Java or C#, such as db4objects and Perst
(McObject).

Benchmarks between ODBMSs and relational DBMSs have shown that ODBMS can be
clearly superior for certain kinds of tasks. The main reason for this is that many

Page 1
Advanced RDBMS

operations are performed using navigational rather than declarative interfaces, and
navigational access to data is usually implemented very efficiently by following pointers.

Critics of Navigational Database-based technologies, like ODBMS, suggest that pointer-


based techniques are optimized for very specific "search routes" or viewpoints. However,
for general-purpose queries on the same information, pointer-based techniques will tend
to be slower and more difficult to formulate than relational. Thus, navigational appears to
simplify specific known uses at the expense of general, unforeseen, and varied future
uses.

Other things that work against ODBMS seem to be the lack of interoperability with a
great number of tools/features that are taken for granted in the SQL world including but
not limited to industry standard connectivity, reporting tools, OLAP tools and backup and
recovery standards. Additionally, object databases lack a formal mathematical
foundation, unlike the relational model, and this in turn leads to weaknesses in their query
support. However, this objection is offset by the fact that some ODBMSs fully support
SQL in addition to navigational access, e.g. Objectivity/SQL++ and Matisse. Effective
use may require compromises to keep both paradigms in sync.

In fact there is an intrinsic tension between the notion of encapsulation, which hides data
and makes it available only through a published set of interface methods, and the
assumption underlying much database technology, which is that data should be accessible
to queries based on data content rather than predefined access paths. Database-centric
thinking tends to view the world through a declarative and attribute-driven viewpoint,
while OOP tends to view the world through a behavioral viewpoint. This is one of the
many impedance mismatch issues surrounding OOP and databases.

Although some commentators have written off object database technology as a failure,
the essential arguments in its favor remain valid, and attempts to integrate database
functionality more closely into object programming languages continue in both the
research and the industrial communities

1.1 Objectives
The objective of this lesson is to learn the Object-Oriented database concepts with respect

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
to Object Identity, Object Structure, Object Databases Standards, Language and
Design and Overview of CORBA.

1.2 Content

1.2.1 Concepts for Object-Oriented Databases

A database is a logical term used to refer a collection of organized and related


information. In any business, certain piece of information about Customer, Product,
Price and so on are called database. A data is just a data until it is organized in a
meaningful way at which point it becomes information.

Page 2
Advanced RDBMS

Through a Database Management System one can Insert, Update, Delete and View the
records in existing file

 Traditional Data Models : Hierarchical, Network (since mid-60’s),


Relational (since 1970 and commercially since 1982).
 Object Oriented (OO) Data Models since mid-90’s.
 Reasons for creation of Object Oriented Databases

– Need for more complex applications


– Need for additional data modeling features
– Increased use of object-oriented programming languages

 Commercial OO Database products – several in the 1990’s, but did not


make much impact on mainstream data management
 Languages: Simula (1960’s), Smalltalk (1970’s), C++ (late 1980’s), Java
(1990’s)

 Experimental Systems: Orion at MCC, IRIS at H-P labs, Open-OODB at


T.I., ODE at ATT Bell labs, Postgres - Montage - Illustra at UC/B,
Encore/Observer at Brown.
 Commercial OO Database products: Ontos, Gemstone, O2 ( -> Ardent),
Objectivity, Objectstore ( -> Excelon), Versant, Poet, Jasmine (Fujitsu – GM).

1.2.2 Overview of Object Oriented Concepts.

 MAIN CLAIM: OO databases try to maintain a direct correspondence between


real-world and database objects so that objects do not lose their integrity and
identity and can easily be identified and operated upon
 Object: Two components: state (value) and behavior (operations). Similar to
program variable in programming language, except that it will typically have a
complex data structure as well as specific operations defined by the programmer
 In OO databases, objects may have an object structure of arbitrary complexity in
order to contain all of the necessary information that describes the object.
 In contrast, in traditional database systems, information about a complex object is
often scattered over many relations or records, leading to loss of direct

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
correspondence between a real-world object and its database representation.
 The internal structure of an object in OOPLs includes the specification of instance
variables, which hold the values that define the internal state of the object.
 An instance variable is similar to the concept of an attribute, except that instance
variables may be encapsulated within the object and thus are not necessarily
visible to external users
 Some OO models insist that all operations a user can apply to an object must be
predefined. This forces a complete encapsulation of objects.
 To encourage encapsulation, an operation is defined in two parts:
– signature or interface of the operation, specifies the operation name and
arguments (or parameters).

Page 3
Advanced RDBMS

– method or body, specifies the implementation of the operation.


 Operations can be invoked by passing a message to an object, which includes the
operation name and the parameters. The object then executes the method for that
operation.
 This encapsulation permits modification of the internal structure of an object, as
well as the implementation of its operations, without the need to disturb the
external programs that invoke these operations
 Some OO systems provide capabilities for dealing with multiple versions of the
same object (a feature that is essential in design and engineering applications).
 For example, an old version of an object that represents a tested and verified
design should be retained until the new version is tested and verified: it is very
crucial for designs in manufacturing process control, architecture , software
systems.
 Operator polymorphism: It refers to an operation’s ability to be applied to
different types of objects; in such a situation, an operation name may refer to
several distinct implementations, depending on the type of objects it is applied to.
 This feature is also called operator overloading

1.2.3 Object identity, Object Structure and Type constructors

 Unique Identity: An OO database system provides a unique identity to each


independent object stored in the database. This unique identity is typically
implemented via a unique, system-generated object identifier, or OID
 The main property required of an OID is that it be immutable; that is, the OID value
of a particular object should not change. This preserves the identity of the real-world
object being represented
 Type Constructors: In OO databases, the state (current value) of a complex object
may be constructed from other objects (or other values) by using certain type
constructors.
-The three most basic constructors are atom, tuple, and set. Other commonly used
constructors include list, bag, and array. The atom constructor is used to represent all
basic atomic values, such as integers, real numbers, character strings, Booleans, and
any other basic data types that the system supports directly.

Example 1, one possible relational database state corresponding to COMPANY


schemaANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 4
Advanced RDBMS

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
We use i1, i2, i3, . . . to stand for unique system-generated object identifiers. Consider the
following objects:
o1 = (i1, atom, ‘Houston’)
o2 = (i2, atom, ‘Bellaire’)
o3=(i3,atom,‘Sugarland’)
o4 = (i4, atom, 5)
o5 = (i5, atom, ‘Research’)
o6 = (i6, atom, ‘1988-05-22’)

Page 5
Advanced RDBMS

o7 = (i7, set, {i1, i2, i3})


o8 = (i8, tuple,<dname:i5, dnumber:i4, mgr:i9, locations:i7, employees:i10,
projects:i11>)
o9 = (i9, tuple, <manager:i12, manager_start_date:i6>)
o10 = (i10, set, {i12, i13, i14})
o11 = (i11, set {i15, i16, i17})
o12 = (i12, tuple, <fname:i18, minit:i19, lname:i20, ssn:i21, . . ., salary:i26,
supervisor:i27, dept:i8>)
The first six objects listed in this example represent atomic values. Object seven is a set-
valued object that represents the set of locations for department 5; the set refers to the
atomic objects with values {‘Houston’, ‘Bellaire’, ‘Sugarland’}. Object 8 is a tuple-
valued object that represents department 5 itself, and has the attributes DNAME,
DNUMBER, MGR, LOCATIONS, and so on.

This example illustrates the difference between the two definitions for comparing
object states for equality.
o1 = (i1, tuple, <a1:i4, a2:i6>)
o2 = (i2, tuple, <a1:i5, a2:i6>)
o3 = (i3, tuple, <a1:i4, a2:i6>)
o4 = (i4, atom, 10)
o5 = (i5, atom, 10)
o6 = (i6, atom, 20)
In this example, The objects o1 and o2 have equal states, since their states at the atomic
level are the same but the values are reached through distinct objects o4 and o5.
However, the states of objects o1 and o3 are identical, even though the objects
themselves are not because they have distinct OIDs. Similarly, although the states of o4
and o5 are identical, the actual objects o4 and o5 are equal but not identical, because they
have distinct OIDs.

1.2.4 Encapsulation of Operations, Methods and Persistence Encapsulation


 One of the main characteristics of OO languages and systems
 Related to the concepts of abstract data types and information hiding in
programming languages
 Specifying Object Behavior via Class Operations:
 The main idea is to define the behavior of a type of object based on the operations
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
that can be externally applied to objects of that type.
 In general, the implementation of an operation can be specified in a general-
purpose programming language that provides flexibility and power in defining
the operations.
 For database applications, the requirement that all objects be completely
encapsulated is too stringent.
 One way of relaxing this requirement is to divide the structure of an object into
visible and hidden attributes (instance variables).
 Adding operations to definitions of Employee and Department
 Specifying Object Persistence via Naming and Reachability:

Page 6
Advanced RDBMS

 Naming Mechanism: Assign an object a unique persistent name through which it


can be retrieved by this and other programs.
 Reachability Mechanism: Make the object reachable from some persistent object.
 An object B is said to be reachable from an object A if a sequence of references in
the object graph lead from object A to object B.
 In traditional database models such as relational model or EER model, all objects
are assumed to be persistent.
 In OO approach, a class declaration specifies only the type and operations for a
class of objects. The user must separately define a persistent object of type set
(DepartmentSet) or list (DepartmentList) whose value is the collection of
references to all persistent DEPARTMENT objects

Creating Persistent objects by naming and reachability


Define class DepartmentSet:
Type set(Department);
Operations add_dept(d:Department): Boolean;
(* adds a department to the DepartmentSet object *)
remove_dept(d:Department): Boolean;
(* this will remove a department from the DepartmentSet Object *)
create_dept_set: DepartmentSet;
destroy_dept_set: Boolean;
end DepartmentSet;
…….
persistent name AllDepartments: DepartmentSet;
(* AllDepartments is a persistent named object of type DepartmentSet *)
…..

1.2.5 Type Hierarchies and Inheritance

Type (class) Hierarchy


A type in its simplest form can be defined by giving it a type name and then listing the
names of its visible (public) functions
When specifying a type in this section, we use the following format, which does not
specify arguments of functions, to simplify the discussion:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
 TYPE_NAME: function, function, . . . , function

Example: PERSON: Name, Address, Birthdate, Age, SSN


Subtype: when the designer or user must create a new type that is similar but not
identical to an already defined type
Supertype: It inherits all the functions of the subtype

Example (1):
EMPLOYEE: Name, Address, Birthdate, Age, SSN, Salary, HireDate, Seniority
STUDENT: Name, Address, Birthdate, Age, SSN, Major, GPA

Page 7
Advanced RDBMS

OR:
EMPLOYEE subtype-of PERSON: Salary, HireDate, Seniority
STUDENT subtype-of PERSON: Major, GPA

Example (2): Consider a type that describes objects in plane geometry, which may be
defined as follows:

GEOMETRY_OBJECT: Shape, Area, ReferencePoint


Now suppose that we want to define a number of subtypes for the
GEOMETRY_OBJECT type, as follows:

RECTANGLE subtype-of GEOMETRY_OBJECT: Width, Height


TRIANGLE subtype-of GEOMETRY_OBJECT: Side1, Side2, Angle
CIRCLE subtype-of GEOMETRY_OBJECT: Radius

An alternative way of declaring these three subtypes is to specify the value of the Shape
attribute as a condition that must be satisfied for objects of each subtype:

RECTANGLE subtype-of GEOMETRY_OBJECT (Shape=‘rectangle’): Width, Height


TRIANGLE subtype-of GEOMETRY_OBJECT (Shape=‘triangle’): Side1, Side2, Angle
CIRCLE subtype-of GEOMETRY_OBJECT (Shape=‘circle’): Radius
 Extents: In most OO databases, the collection of objects in an extent has the same
type or class. However, since the majority of OO databases support types, we
assume that extents are collections of objects of the same type for the remainder
of this section.
 Persistent Collection: It holds a collection of objects that is stored permanently in
the database and hence can be accessed and shared by multiple programs
 Transient Collection: It exists temporarily during the execution of a program but
is not kept when the program terminates

1.2.6 Complex Objects

 Unstructured complex object: It is provided by a DBMS and permits the storage and
retrieval of large objects that are needed by the database application.
 Typical examples of such objects are bitmap images and long text strings

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(documents); they are also known as binary large objects, or BLOBs for short.
 This has been the standard way by which Relational DBMSs have dealt with
supporting complex objects, leaving the operations on those objects outside the
RDBMS
 Structured complex object: It differs from an unstructured complex object in that the
object’s structure is defined by repeated application of the type constructors provided
by the OODBMS. Hence, the object structure is defined and known to the OODBMS.
The OODBMS also defines methods or operations on it.

1.2.7 Other Object-Oriented Concepts

Page 8
Advanced RDBMS

Object Databases Standards


Why a standard is needed? A Standard in any Object Model refers to the following
aspects:

 Portability: execute an application program on different systems with minimal


modifications to the program.
 Interoperability

ODMG standard refers to - object model, object definition language (ODL), object
query language (OQL), and bindings to object-oriented programming languages.

An Object Model explains the data model upon which ODL and OQL are based. It also
provides data type and type constructors. SQL report describes a standard data model for
relational database.
Relation between an Object and literal is – a Literal has only a value but no object
identifier. An Object has four characteristics:
•identifier
•Name
•life time (persistent or not)
•Structure (how to construct)

Object Database Language

a. Object Definition Language (ODL)


An Object Definition Language is designed to support the semantic constructs of the
ODMG data model. It is Independent of any programming language and helps to Create
object specifications such as classes and interfaces and also Specify a database schema.

b. Object Query Language (OQL)


An Object Query language is:
•Embedded into one of these programming languages
•Return objects that match the type system of that language
•Similar to SQL with additional features (object identity, complex objects, operations,
inheritance, polymorphism, relationships)

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
c. OQL Entry Points and Iterator Variables .

Entry point is a named persistent object (for many queries, it is the name of the extent of
a class). An Iterator variable is used when a collection is referenced in OQL query.

d. OQL -Query Results and Path Expressions


Any persistent object is a query, result is a reference to that persistent object. Path
expression is used to specify a path to related attributes and objects once an entry point is
specified.

Page 9
Advanced RDBMS

e. OQL Collection Operators

OQL Collection Operators include Aggregate operators such as: min, max, count, sum,
and avg.

Object Database Conceptual Design

The Object Database Conceptual Design includes:


ODB: relationships are handled by OID references to the related objects.
RDB: relationships among tuples are specified by attributes with matching values (value
references).
ORDBMS: enhancing the capabilities of RDBMS with some of the features in ODBMS.

Other Concepts of Object Database

 Polymorphism (Operator Overloading):


 This concept allows the same operator name or symbol to be bound to
two or more different implementations of the operator, depending on the
type of objects to which the operator is applied
 Multiple Inheritance and Selective Inheritance
Multiple inheritances in a type hierarchy occurs when a certain subtype T is a subtype of
two (or more) types and hence inherits the functions (attributes and methods) of both
supertypes.
For example, we may create a subtype ENGINEERING_MANAGER that is a subtype of
both MANAGER and ENGINEER. This leads to the creation of a type lattice rather than
a type hierarchy.
 Versions and Configurations
 Many database applications that use OO systems require the existence of several
versions of the same object
 There may be more than two versions of an object.
 Configuration: A configuration of the complex object is a collection consisting of
one version of each module arranged in such a way that the module versions in
the configuration are compatible and together form a valid version of the complex
object.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
ODMG (Object Data Management Group)

ODMG 2.0 of the ODMG Standard differs from Release 1.2 in a number of ways.
With the wide acceptance of Java, we added a Java Persistence Standard in addition to
the existing Smalltalk and C++ ones. The ODMG object model is much more
comprehensive, added a meta object interface, defined an object interchange format,
and worked to make the programming language bindings consistent with the common
model. The changes made throughout the specification based on several years of
experience implementing the standard in object database products.

Page 10
Advanced RDBMS

As with Release 1.2, we expect future work to be backward compatible with Release
2.0. Although we expect a few changes to come, for example to the Java binding, the
Standard should now be reasonable stable.

The major components of ODMG 2.0 are:

Object Model. We have used the OMG Object Model as the basis for our model. The
OMG core model was designed to be a common denominator for object request
brokers, object database systems, object programming languages, and other
applications. In keeping with the OMG Architecture, we have designed an ODBMS
profile for the model, adding components (relationships) to the OMG core object
model to support our needs. Release 2.0 introduces a meta model.

The Object Data Management Group (ODMG) was a consortium of object database and
object-relational mapping vendors, members of the academic community, and interested
parties. Its goal was to create a set of specifications that would allow for portable
applications that store objects in database management systems. It published several
versions of its specification. The last release was ODMG 3.0. By 2001, most of the major
object database and object-relational mapping vendors claimed conformance to the
ODMG Java Language Binding. Compliance to the other components of the specification
was mixed. In 2001, the ODMG Java Language Binding was submitted to the Java
Community Process as a basis for the Java Data Objects specification. The ODMG
member companies then decided to concentrate their efforts on the Java Data Objects
specification. As a result, the ODMG disbanded in 2001.

Many object database ideas were also absorbed into SQL:1999 and have been
implemented in varying degrees in object-relational database products.

In 2005 Cook, Rai, and Rosenberger proposed to drop all standardization efforts to
introduce additional object-oriented query APIs but rather use the OO programming
language itself, i.e., Java and .NET, to express queries. As a result, Native Queries
emerged. Similarly, Microsoft announced Language Integrated Query (LINQ) and
DLINQ, an implementation of LINQ, in September 2005, to provide close, language-
integrated database query capabilities with its programming languages C# and VB.NET
9.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In February 2006, the Object Management Group (OMG) announced that they had been
granted the right to develop new specifications based on the ODMG 3.0 specification and
the formation of the Object Database Technology Working Group (ODBT WG). The
ODBT WG plans to create a set of standards that incorporates advances in object
database technology (e.g., replication), data management (e.g., spatial indexing), and data
formats (e.g., XML) and to include new features into these standards that support
domains in real-time systems where object databases are being adopted

Page 11
Advanced RDBMS

Object Definition Language (ODL)

Lets take a look at something that comes closer to bearing a relationship to our everyday
programming. Whether you generate your applications or code them, somehow you need
a way to describe your object model. The goal of this Object Definition Language (ODL)
is to capture enough information to be able to generate the majority of most SMB web
apps directly from a set of statements in the language . . .

Here is a rough cut of ODL along with comments. This is very much a work in progress.
Now that I have a meta-grammar and a concrete syntax for describing languages, I can
start to write the languages I have been playing with. I will then build up to those
languages in the framework so that the framework can consume metadata that can be
transformed automatically from ODL, allowing for the automatic generation of most of
my code. Expect to see BIG changes in this grammar as I combine “top down” and
“bottom up” programming, write some real world applications and see where everything
meets in the middle!

Most importantly, we have objects that are comprised of 1..n attributes and that may or
may not have relationships. This is the high level UML model kind of stuff. Note that
ODL is describing functional metadata, so an object would be “Article” – not
“ArticleService” or “ArticleDAO” which are implementation decisions and would be
generated from the Article metadata automatically.

Object Query Language (OQL)

But before that we will digress into built-in functions supported in OQL The built-in
functions in OQL fall into the following categories:

 Functions that operate on individual Java Objects


1. sizeof(o)-- returns size of Java object in bytes
2. objectid(o)-- returns unique id of Java object
3. classof(o)-- returns Class object for given Java object
4. identical(o1, o2) -- returns (boolean) whether two given object are
identical or not (essentially objectid(o1) == objectid(o2). Do not use
simple JavaScript reference comparison for Java Objects!)
5. referrers(o) -- returns array of objects refering to given Java object
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
6. referees(o) -- returns array of objects referred by given Java object
7. reachables(o) -- returns array of objects directly or indirectly referred
from given Java object (transitive closure of referees of given object)
 Functions that operate operate on arrays
1. contains(array, expr) -- returns array contains an element that satisfies
given expression The expression can refer to built-in variable 'it'. This is
current object iterated
2. count(array, [expr]) -- returns number of elements satisfying given
expression

Page 12
Advanced RDBMS

3. filter(array, expr) -- returns a new array containing elements


satisfying given expression
4. map(array, expr) -- returns a new array that contains results of
applying given expression on each element of input array
5. sort(array, [expr]) -- sorts the given array. optionally accepts
comparison expression to use. if not given, sort uses numerical
comparison
6. sum(array) -- sums all elements of array

As you can see, most array operating functions accept boolean expression -- the
expression can refer to current object by it variable. This allows operating on arrays
without loops -- the built-in functions loop through the array and 'apply' the
expression on each element.

There is also built-in object called heap. There are various useful methods in heap
object.

Now, let us see some interesting queries.

Select all objects referred by a SoftReference:


select f.referent from java.lang.ref.SoftReference f
where f.referent != null

referent is a private field of java.lang.ref.SoftReference class (actually inherited


field from java.lang.ref.Reference. You may use javap -p to find these!) We
filter the SoftReferences that have been cleared (i.e., referent is null).

Show referents that are not referred by another object. i.e., the referent is
reachable only by that soft reference:

select f.referent from java.lang.ref.SoftReference f


where f.referent != null && referrers(f.referent).length
== 1

Note that use of referrers built-in function to find the referrers of a given object.
because referrers returns an array, the result supports length property.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Let us refine above query. We want to find all objects that referred only by soft
references but we don't care how many soft references refer to it. i.e., we allow more
than one soft reference to refer to it.

select f.referent from java.lang.ref.SoftReference f where f.referent != null &&


filter(referrers(f.referent), "classof(it).name !=
'java.lang.ref.SoftReference'").length== 0

Note that filter function filters the referrers array using a boolean expression. In the
filter condition we check the class name of referrer is not java.lang.ref.SoftReference.

Page 13
Advanced RDBMS

Now, if the filtered arrays contain atleast one element, then we know that f.referent is
referred from some object that is not of type java.lang.ref.SoftReference!

Find all finalizable objects (i.e., objects that are some class that has
'java.lang.Object.finalize()' method overriden)

select f.referent from java.lang.ref.Finalizer f


where f.referent != null

How does this work? When an instance of a class that overrides finalize() method is
created (potentially finalizable object), JVM registers the object by creating an
instance of java.lang.ref.Finalizer. The referent field of that Finalizer object refers to
the newly created "to be finalized" object. (dependency on implementation detail!)

Find all finalizable objects and approximate size of the heap retained because of
those.

select { obj: f.referent, size:


sum(map(reachables(f.referent), "sizeof(it)")) }
from java.lang.ref.Finalizer f
where f.referent != null

Certainly this looks really complex -- but, actually it is simple. The JavaScript object
literal used to select multiple values in the select expression (obj and size properties).
reachables finds objects reachable from given object. map creates a new array from
input array by applying given expression on each element. The map call in this query
would create an array of sizes of each reachable object. sum built-in adds all elements of
array. So, we get total size of reachable objects from given object (f.referent in this case).
Why do I say approximate size? HPROF binary heap dump format does not account for
actual bytes used in live JVM. Instead sizes just enough to hold the data are used. For eg.
JVMs would align smaller data types such as 'char' -- JVMs would use 4 bytes instead of
2 bytes. Also, JVMs tend to use one or two header words with each object. All these are
not accounted in HPROF file dump. HPROF uses minimal size needed to hold the data -
for example 2 bytes for a char, 1 byte for a boolean and so on

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
1.2.8 Overview of C++ Language Binding

The C++ binding to ODBMSs includes a version of the ODL that uses C++ syntax, a
mechanism to invoke OQL, and procedures for operations on databases and transactions

The Object Definition Language (ODL) is the declarative portion of C++ ODL/OML.
The C++ binding of ODL is expressed as a library that provides classes and functions
to implement the concepts defined in the ODMG object model. OML is a language
used for retrieving objects from the database and modifying them. The C++ OML
syntax and semantics are those of standard C++ in the context of the standard class
library.

Page 14
Advanced RDBMS

ODL/OML specifies only the logical characteristics of objects and the operations used
to manipulate them. It does not discuss the physical storage of objects. It does not
address the clustering or memory management issues associated with the stored
physical representation of objects or access structures. In an ideal world, these would
be transparent to the programmer. In the real world, they are not. An additional set of
constructs called "physical pragmas" is defined to give the programmer some direct
control over these issues, or at least to enable a programmer to provide "hints" to the
storage management subsystem provided as part of the ODBMS run time. Physical
pragmas exist within the ODL and OML. They are added to object type definitions
specified in ODL, expressed as OML operations, or shown as optional arguments to
operations defined within OML.

These pragmas are not in any sense stand-alone languages, but rather a set of
constructs added to ODL/OML to address implementation issues.

The programming-language-specific bindings for ODL/OML are based on one basic


principle -- that the programmer feels that there is one language, not two separate
languages with arbitrary boundaries between them.

The ODMG Smalltalk binding is based upon two principles -- it should bind to
Smalltalk in a natural way that is consistent with the principles of the language, and it
should support language interoperability consistent with ODL specification and
semantics. We believe that organizations specifying their objects in ODL will insist
that the Smalltalk binding honor those specifications. These principles have several
implications that are evident in the design of the binding:

 There is a unified type system that is shared by Smalltalk and the ODBMS.
 This type system is ODL as mapped into Smalltalk by the Smalltalk binding.
 The binding respects the Smalltalk syntax, meaning the Smalltalk language
will not have to be modified to accommodate this binding.
 ODL concepts will be represented using normal Smalltalk coding conventions.
 The binding respects the fact that Smalltalk is dynamically typed. Arbitrary

Smalltalk objects may be stored persistently, including ODL-specified objects,


which will obey the ODL typing semantics.


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The binding respects the dynamic memory-management semantics of

Smalltalk. Objects will become persistent when they are referenced by other
persistent objects in the database, and will be removed when they are no longer
reachable in this manner.

As with other language bindings, ODMG Java binding is based on one fundamental
principle -- the programmer should perceive the binding as a single language for

Page 15
Advanced RDBMS

expressing both database and programming operations, not two separate languages
with arbitrary boundaries between them. This principle has several corollaries:

 There is a single, unified type system shared by the Java language and the

object database; individual instances of these common types can be persistent or


transient.

 The binding respects the Java language syntax, meaning that the Java language

will not have to be modified to accommodate this binding.

 The binding respects the automatic storage management semantics of Java.


Objects will become persistent when they are referenced by other persistent
objects in the database, and will be removed when they are no longer
reachable in this manner.

The Java binding provides persistence by reachability, like the ODMG Smalltalk
binding (this has also been called "transitive persistence"). On database commit, all
objects reachable from database root objects are stored in the database.
The Java binding provides two ways to declare persistence-capable Java classes:

 Existing Java classes can be made persistence capable.


 Java class declarations (as well as a database schema) may automatically be

generated by a preprocessor for ODMG ODL.

One possible ODMG implementation that supports these capabilities would be a


postprocessor that takes as input the Java .class file (bytecodes) produced by the
Java compiler, then produces new modified bytecodes that support persistence.
Another implementation would be a preprocessor that modifies Java source before it
goes to the Java compiler. Another implementation would be a modified Java
interpreter.

We want a binding that allows all of these possible implementations. Because Java

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
does not have all the hooks we might desire, and the Java binding must use standard
Java syntax, it is necessary to distinguish special classes understood by the database
system. These classes are called persistence-capable classes. They can have both
persistent and transient instances. Only instances of these classes can be made
persistent. The current version of the standard does not define how a Java class
becomes a persistence-capable class.

Object Database conceptual Design

 Traditional Data Models : Hierarchical, Network (since mid-60’s), Relational


(since 1970 and commercially since 1982)

Page 16
Advanced RDBMS

 Object Oriented (OO) Data Models since mid-90’s


 Reasons for creation of Object Oriented Databases
– Need for more complex applications
– Need for additional data modeling features
– Increased use of object-oriented programming languages
 Commercial OO Database products – several in the 1990’s, but did not make
much impact on mainstream data management
 MAIN CLAIM: OO databases try to maintain a direct correspondence between
real-world and database objects so that objects do not lose their integrity and
identity and can easily be identified and operated upon
 Object: Two components: state (value) and behavior (operations). Similar to
program variable in programming language, except that it will typically have a
complex data structure as well as specific operations defined by the programmer
 In OO databases, objects may have an object structure of arbitrary complexity in
order to contain all of the necessary information that describes the object.
 In contrast, in traditional database systems, information about a complex object is
often scattered over many relations or records, leading to loss of direct
correspondence between a real-world object and its database representation
 The internal structure of an object in OOPLs includes the specification of instance
variables, which hold the values that define the internal state of the object.
 An instance variable is similar to the concept of an attribute, except that instance
variables may be encapsulated within the object and thus are not necessarily
visible to external users
 Some OO models insist that all operations a user can apply to an object must be
predefined. This forces a complete encapsulation of objects.
 To encourage encapsulation, an operation is defined in two parts:
1. signature or interface of the operation, specifies the operation name and
arguments (or parameters).
2. method or body, specifies the implementation of the operation.
 Operations can be invoked by passing a message to an object, which includes the
operation name and the parameters. The object then executes the method for that
operation.
 This encapsulation permits modification of the internal structure of an object, as
well as the implementation of its operations, without the need to disturb the
external programs that invoke these operations
 Some OO systems provide capabilities for dealing with multiple versions of the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
same object (a feature that is essential in design and engineering applications).
1. For example, an old version of an object that represents a tested and
verified design should be retained until the new version is tested and
verified:
2. very crucial for designs in manufacturing process control, architecture ,
software systems …..
 Operator polymorphism: It refers to an operation’s ability to be applied to
different types of objects; in such a situation, an operation name may refer to
several distinct implementations, depending on the type of objects it is applied to.
 This feature is also called operator overloading

Page 17
Advanced RDBMS

 Unique Identity: An OO database system provides a unique identity to each


independent object stored in the database. This unique identity is typically
implemented via a unique, system-generated object identifier, or OID
 The main property required of an OID is that it be immutable; that is, the OID
value of a particular object should not change. This preserves the identity of the
real-world object being represented
 Type Constructors: In OO databases, the state (current value) of a complex object
may be constructed from other objects (or other values) by using certain type
constructors.
 The three most basic constructors are atom, tuple, and set. Other commonly used
constructors include list, bag, and array. The atom constructor is used to represent
all basic atomic values, such as integers, real numbers, character strings,
Booleans, and any other basic data types that the system supports directly.

Overview of the CORBA Standard for Distributed Objects

The Common Object Request Broker Architecture (or CORBA) is an industry standard
developed by the Object Management Group (OMG) to aid in distributed objects
programming. It is important to note that CORBA is simply a specification. A CORBA
implementation is known as an ORB (or Object Request Broker). There are several
CORBA implementations available on the market such as VisiBroker, ORBIX, and
others. JavaIDL is another implementation that comes as a core package with the JDK1.3
or above.

CORBA was designed to be platform and language independent. Therefore, CORBA


objects can run on any platform, located anywhere on the network, and can be written in
any language that has Interface Definition Language (IDL) mappings.

Similar to RMI, CORBA objects are specified with interfaces. Interfaces in CORBA,
however, are specified in IDL. While IDL is similar to C++, it is important to note that
IDL is not a programming language. For a detailed introduction to CORBA

The Genesis of a CORBA Application

There are a number of steps involved in developing CORBA applications. These are:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Define an interface in IDL
Map the IDL interface to Java (done automatically)
Implement the interface
Develop the server
Develop a client
Run the naming service, the server, and the client.
We now explain each step by walking you through the development of a CORBA-based
file transfer application, which is similar to the RMI application we developed earlier in
this article. Here we will be using the JavaIDL, which is a core package of JDK1.3+.

Page 18
Advanced RDBMS

a. Define the Interface

When defining a CORBA interface, think about the type of operations that the server will
support. In the file transfer application, the client will invoke a method to download a
file. Code Sample 5 shows the interface for FileInterface. Data is a new type introduced
using the typedef keyword. A sequence in IDL is similar to an array except that a
sequence does not have a fixed size. An octet is an 8-bit quantity that is equivalent to the
Java type byte.

Note that the downloadFile method takes one parameter of type string that is declared in.
IDL defines three parameter-passing modes: in (for input from client to server), out (for
output from server to client), and inout (used for both input and output).

Code Sample
FileInterface.idl interface FileInterface
{
typedef sequence<octet> Data;
Data downloadFile(in string fileName);
};

Once you finish defining the IDL interface, you are ready to compile it. The JDK1.3+
comes with the idlj compiler, which is used to map IDL definitions into Java declarations
and statements.

The idle compiler accepts options that allow you to specify if you wish to generate client
stubs, server skeletons, or both. The -f<side> option is used to specify what to generate.
The side can be client, server, or all for client stubs and server skeletons. In this example,
since the application will be running on two separate machines, the -fserver option is
used on the server side, and the -fclient option is used on the client side.

Now, let's compile the FileInterface.idl and generate server-side skeletons. Using the
command:

prompt> idlj -fserver FileInterface.idl


This command generates several files such as skeletons, holder and helper classes, and
others. An important file that gets generated is the _FileInterfaceImplBase, which will be
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
subclassed by the class that implements the interface.

b. Implement the interface


Now, we provide an implementation to the downloadFile method. This implementation is
known as a servant, and as you can see from Code Sample 1, the class FileServant
extends the _FileInterfaceImplBase class to specify that this servant is a CORBA object.

Page 19
Advanced RDBMS

Code Sample 1: FileServant.java


import java.io.*;

public class FileServant extends _FileInterfaceImplBase {


public byte[] downloadFile(String fileName){
File file = new File(fileName);
byte buffer[] = new byte[(int)file.length()];
try {
BufferedInputStream input = new
BufferedInputStream(new FileInputStream(fileName));
input.read(buffer,0,buffer.length);
input.close();
} catch(Exception e) {
System.out.println("FileServant Error: "+e.getMessage());
e.printStackTrace();
}
return(buffer);
}
}

c. Develop the server


The next step is developing the CORBA server. The FileServer class, shown in Code
Sample 2, implements a CORBA server that does the following:
Initializes the ORB
Creates a FileServant object
Registers the object in the CORBA Naming Service (COS Naming)
Prints a status message
Waits for incoming client requests

Code Sample 2 FileServer.java


import java.io.*;
import org.omg.CosNaming.*;
import org.omg.CosNaming.NamingContextPackage.*;
import org.omg.CORBA.*;
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
public class FileServer {
public static void main(String args[]) {
try{
// create and initialize the ORB
ORB orb = ORB.init(args, null);
// create the servant and register it with the ORB
FileServant fileRef = new FileServant();
orb.connect(fileRef);
// get the root naming context
org.omg.CORBA.Object objRef =

Page 20
Advanced RDBMS

orb.resolve_initial_references("NameService");
NamingContext ncRef = NamingContextHelper.narrow(objRef);
// Bind the object reference in naming
NameComponent nc = new NameComponent("FileTransfer", " ");
NameComponent path[] = {nc};
ncRef.rebind(path, fileRef);
System.out.println("Server started....");
// Wait for invocations from clients
java.lang.Object sync = new java.lang.Object();
synchronized(sync){
sync.wait();
}
} catch(Exception e) {
System.err.println("ERROR: " + e.getMessage());
e.printStackTrace(System.out);
}
}
}

Once the FileServer has an ORB, it can register the CORBA service. It uses the COS
Naming Service specified by OMG and implemented by Java IDL to do the registration.
It starts by getting a reference to the root of the naming service. This returns a generic
CORBA object. To use it as a NamingContext object, it must be narrowed down (in other
words, casted) to its proper type, and this is done using the statement:

NamingContext ncRef = NamingContextHelper.narrow(objRef);


The ncRef object is now an org.omg.CosNaming.NamingContext. You can use it to
register a CORBA service with the naming service using the rebind method.

d. Develop a client
The next step is to develop a client. An implementation is shown in Code Sample 3. Once
a reference to the naming service has been obtained, it can be used to access the naming
service and find other services (for example the FileTransfer service). When the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
FileTransfer service is found, the downloadFile method is invoked.

Code Sample 3: FileClient


import java.io.*;
import java.util.*;
import org.omg.CosNaming.*;
import org.omg.CORBA.*;

public class FileClient {


public static void main(String argv[]) {

Page 21
Advanced RDBMS

try {
// create and initialize the ORB
ORB orb = ORB.init(argv, null);
// get the root naming context
org.omg.CORBA.Object objRef =
orb.resolve_initial_references("NameService");
NamingContext ncRef = NamingContextHelper.narrow(objRef);
NameComponent nc = new NameComponent("FileTransfer", " ");
// Resolve the object reference in naming
NameComponent path[] = {nc};
FileInterfaceOperations fileRef =
FileInterfaceHelper.narrow(ncRef.resolve(path));

if(argv.length < 1) {
System.out.println("Usage: java FileClient filename");
}

// save the file


File file = new File(argv[0]);
byte data[] = fileRef.downloadFile(argv[0]);
BufferedOutputStream output = new
BufferedOutputStream(new FileOutputStream(argv[0]));
output.write(data, 0, data.length);
output.flush();
output.close();
} catch(Exception e) {
System.out.println("FileClient Error: " + e.getMessage());
e.printStackTrace();
}
}
}

e. Running the application


The final step is to run the application. There are several sub-steps involved:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Running the CORBA naming service. This can be done using the command tnameserv.
By default, it runs on port 900. If you cannot run the naming service on this port, then
you can start it on another port. To start it on port 2500, for example, use the following
command:

prompt> tnameserv -ORBinitialPort 2500


Start the server. This can be done as follows, assuming that the naming service is running on the default port number:

prompt> java FileServer


If the naming service is running on a different port number, say 2500, then you need to
specify the port using the ORBInitialPort option as follows:
prompt> java FileServer -ORBInitialPort 2500

Page 22
Advanced RDBMS

Generate Stubs for the client. Before we can run the client, we need to generate stubs for the client. To do that, get a
copy of the FileInterface.idl file and compile it using the idlj compiler specifying that you wish to generate client-
side stubs, as follows:

prompt> idlj -fclient FileInterface.idl


Run the client. Now you can run the client using the following command, assuming that
the naming service is running on port 2500.

prompt> java FileClient hello.txt -ORBInitialPort 2500

Where hello.txt is the file we wish to download from the server.


Note: if the naming service is running on a different host, then use the -ORBInitialHost
option to specify where it is running. For example, if the naming service is running on
port number 4500 on a host with the name gosling, then you start the client as follows:
prompt> java FileClient hello.txt -ORBInitialHost gosling -ORBInitialPort 4500
Alternatively, these options can be specified at the code level using properties. So instead
of initializing the ORB as:
ORB orb = ORB.init(argv, null);
It can be initialized specifying that the CORBA server machine (called gosling) and the
naming service's port number (to be 2500) as follows:

Properties props = new Properties();


props.put("org.omg.CORBA.ORBInitialHost", "gosling");
props.put("orb.omg.CORBA.ORBInitialPort", "2500");
ORB orb = ORB.init(args, props);
Exercise
In the file transfer application, the client (in both cases RMI and CORBA) needs to know
the name of the file to be downloaded in advance. No methods are provided to list the
files available on the server. As an exercise, you may want to enhance the application by
adding another method that lists the files available on the server. Also, instead of using a
command-line client you may want to develop a GUI-based client. When the client starts
up, it invokes a method on the server to get a list of files then pops up a menu displaying
the files available where the user would be able to select one or more files to be
downloaded.

Developing distributed object-based applications can be done in Java using RMI or


JavaIDL (an implementation of CORBA). The use of both technologies is similar since
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
the first step is to define an interface for the object. Unlike RMI, however, where
interfaces are defined in Java, CORBA interfaces are defined in the Interface Definition
Language (IDL). This, however, adds another layer of complexity where the developer
needs to be familiar with IDL, and equally important, its mapping to Java.

Making a selection between these two distribution mechanisms really depends on the
project at hand and its requirements. I hope this article has provided you with enough
information to get started developing distributed object-based applications and enough
guidance to help you select a distribution mechanism

Page 23
Advanced RDBMS

 CORBA/IIOP support. Extends application services to Web clients, for


integration with your existing applications architecture.
 Flexible, pervasive security. Personalize access to data and applications based on
individual and group roles. Extend security to HTML files and other data, for
pervasive security no matter how or where Web content is stored.
 Enhanced HTTP stack. The HTTP engine delivers outstanding performance and
Java servlet support.
 Integration with Microsoft IIS. Use IIS as the HTTP engine for ValidSolutions,
to dramatically enhance IIS security and bring 21 CFR part 11 compliant Web
application services to your NT-based Web environment.

With support for CORBA and IIOP, the ValidSolution allows you to create client/server
Web applications that take advantage of the web objects and application services. In
addition, you can now access back-end relational databases for enhanced data integration
using the Enterprise Connection Services.

Valid Components can leverage the Enterprise Connection Services (ECS) for building
live links between pages and forms, to data from relational databases. To set up the links,
you simply use the ECS template application to identify your forms and fields that will
contain external source data, and to define the real-time connection settings. You can set
up connections for DB2, Oracle, Sybase, EDA/SQL, and ODBC.

The Domino Application Server also allows you to design applications with CORBA-
standard distributed objects

1.2.9 Object relational and Extended Relational Database Systems Evolution &
Current trends of Database Technology

Security concerns must be addressed when developing a distributed database. When


choosing between the objectoriented model and the relational model, many factors should
be considered. The most important of these factors are single level and multilevel access
controls, protection against inference, and maintenance of integrity. When determining
which distributed database model will be more secure for a particular application, the
decision should not be made purely on the basis of available security features. One
should also question the efficacy and efficiency of the delivery of these features. Do the

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
features provided by the database model provide adequate security for the intended
application? Does the implementation of the security controls add an unacceptable
amount of computational overhead? In this paper, the security strengths and weaknesses
of both database models and the special problems found in the distributed environment
are discussed.

As distributed networks become more popular, the need for improvement in distributed
database management systems becomes even more important. A distributed system varies
from a centralized system in one key respect:

Page 24
Advanced RDBMS

The data and often the control of the data are spread out over two or more geographically
separate sites. Distributed database management systems are subject to many security
threats additional to those present in a centralized database management system (DBMS).
Furthermore, the development of adequate distributed database security has been
complicated by the relatively recent introduction of the object-oriented database model.
This new model cannot be ignored. It has been created to address the growing complexity
of the data stored in present database systems.

For the past several years the most prevalent database model has been relational. While
the relational model has been particularly useful, its utility is reduced if the data does not
fit into a relational table. Many organizations have data requirements that are more
complex than can be handled with these data types. Multimedia data, graphics, and
photographs are examples of these complex data types.

Relational databases typically treat complex data types as BLOBs (binary large objects).
For many users, this is inadequate since BLOBs cannot be queried. In addition, database
developers have had to contend with the impedance mismatch between the third
generation language (3GL) and structured query language (SQL). The impedance
mismatch occurs when the 3GL command set conflicts with SQL. There are two types of
impedance mismatches: (1) Data type inconsistency: A data type recognized by the
relational database is not recognized by the 3GL. For example, most 3GLs don’t have a
data type for dates. In order to process date fields, the 3GL must convert the date into a
string or a Julian date. This conversion adds extra processing overhead. (2) Data
manipulation inconsistency: Most procedural languages read only one record at a time,
while SQL reads records a set at a time. This problem is typically overcome by
embedding SQL commands in the 3GL code. Solutions to both impedance problems add
complexity and overhead. Object-oriented databases have been developed in response to
the problems listed above: They can fully integrate complex data types, and their use
eliminates the impedance mismatch [Mull94].

The development of relational database security procedures and standards is a more


mature field than for the object-oriented model. This is principally due to the fact that
object-oriented databases are relatively new. The relative immaturity of the object-
oriented model is particularly evident in distributed applications. An inconsistent
standard is an example: Developers have not embraced a single set of standards for
distributed object-oriented databases, while standards for relational databases are well
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
established [Sud95]. One implication of this disparity is the inadequacy of controls in
multilevel heterogeneous distributed object-oriented systems.

In this paper, we will review the security concerns of databases in general and distributed
databases in particular. We will examine the security problems found in both models, and
we will examine the security problems unique to each system. Finally, we will compare
the relative merits of each model with respect to security.

Page 25
Advanced RDBMS

1.2.10 The Informix Universal Server

While Oracle and Sybase come to mind first when thinking of relational database
technology for the Unix platform, Informix Corp. claims the largest installed base of
relational database engines running on Unix. (See "Informix on the Move," DBMS,
November 1995, page 46.) Furthermore, Informix appears to be focused more
specifically on a mission statement to deliver "... the best technology and services for
developing enterprisewide data management applications for open systems." Something
must be working right. Informix's 1995 revenue ($709 million) and net income ($105.3
million) are up by more than 50 percent and 59 percent, respectively, compared to 1994.
This puts Informix on track to join the ranks of other billion dollar software businesses
within the next year or two.

Founded in 1980 by Roger Sippl, Informix went public in 1986 and released its current
top-of-the-line product, the OnLine Dynamic Server RDBMS, in 1988. While the current
Informix product line reflects a focus on database servers and tools, Informix has always
encouraged a healthy applications market founded on the use of its tools and server
engines. Whereas Oracle developed its own line of accounting and distribution
applications, Informix left this to third parties. Both FourGen Software (Seattle, Wash.)
and Concepts Dynamic (Schaumburg, Ill.), among others, have developed full accounting
application suites based on the Informix RDBMS and built with Informix development
tools.

The only time Informix diverted from its database-centric strategy was in 1988, when it
merged with Innovative Software, adding the SmartWare desktop applications suite to its
database-centric product line. This product acquisition, together with that of the Wingz
graphical spreadsheet, followed a pattern similar to Novell's later acquisition of
WordPerfect's desktop business. Both companies, Informix and Novell, moved into
businesses that they did not understand and eventually divested the products they
acquired. Also, just as the WordPerfect acquisition triggered the departure of Novell
founder Ray Noorda, the SmartWare acquisition triggered the departure of Roger Sippl
from Informix.

Both Informix and Novell subsequently refocused on their core businesses as a result of
these forays into desktop applications. The current chairman, president, and CEO of
Informix, Phillip E. White, joined the company in 1989. He took over in 1992 from
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Roger Sippl, who left to found Visigenic, a database access company focused on ODBC
technology. White is credited with increasing shareholder value from 56 cents per share
at the end of 1990 to $30 per share at the end of 1995. This performance placed Informix
at the top of the Wall Street Journal's Shareholder Scoreboard for best five-year
performer.

Without the opportunity to grow revenues through diversifying into applications or other
non-database areas, Informix could face difficulties in sustaining its growth.
Consequently, Informix is pursuing a number of strategies to strengthen and differentiate
its core database products in order to reach new markets. These strategies include:

Page 26
Advanced RDBMS

* increasing the range of data types that Informix RDBMS engines can handle

* establishing Informix engines as data warehousing platforms

* making Informix servers attractive for use in mobile computing

* taking advantage of the Internet to reach new database markets

* exploiting other emerging technologies, such as SmartCards

Dynamic Scalable Architecture (DSA)

DSA is the marketing term for a database architecture designed to position Informix as a
leading provider in the area of parallel processing and scalable database server
technology. DSA provides a foundation for a range of high-end Informix database servers
based on variants of the same core engine technology:

* The OnLine Extended Parallel Server is designed for very high-volume OLTP
environments that need to utilize loosely coupled or shared-nothing computing
architectures composed of clusters of symmetrical multiprocessing (SMP) or massively
parallel processing (MPP) systems.

* The Online Dynamic Server is designed for high-volume OLTP environments that
require replication, mainframe-level database administration tools, and the performance
delivered by Informix's parallel data query technology (PDQ). PDQ enables parallel table
scans, sorts, and joins, parallel query aggregation for decision support and parallel data
loads, index builds, backups, and restores. Although this server supports SMP it does not
support MPP, which is the essential differentiating feature between the OnLine Dynamic
Server and the OnLine Extended Parallel Server.

* The OnLine Workgroup Server is designed for smaller numbers of user connections (up
to 32 concurrent) and lower transaction volumes. It is also easier to administer because it
offers less complex functionality compared to the higher-end servers.

These three server products position Informix to compete effectively against similar

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
stratified server families from Oracle, IBM, and Sybase, as well as niche players such as
Microsoft with its SQL Server product and Computer Associates with CA-OpenIngres.
However, while IBM may lead with the exceptional database administration breadth and
depth of its DB2 engine or Microsoft with the ease of use of its graphical administration
tools, Informix is setting the pace in support for parallel processing that addresses an
issue dear to every database users' heart, namely performance.

Informix-Universal Server

Informix has supported binary large object (BLOB) data for many years but the company
recognizes that the need to store, and more important, to manipulate complex data other

Page 27
Advanced RDBMS

than text and numeric data, will be critical to its ability to address future customer needs.
For this reason, Informix recently completed its acquisition of Illustra Information
Technologies, founded by Ingres RDBMS designer Dr. Michael Stonebraker. Illustra
specializes in handling image, 2D and 3D spatial data, time series, video, audio, and
document data using snap-in modules called DataBlades that add object handling
capabilities to an RDBMS via extensions to SQL. Informix has announced its intention to
fully integrate Illustra technology into a new Informix-Universal Server product within
the next year.

If Informix manages this task, and analysts such as Richard Finkelstein of Performance
Computing doubt that it will (see Computerworld, February 12, 1996), Informix-
Universal Server could put Informix in a unique position to service specialized and
highly profitable markets such as:

* multimedia asset management for the entertainment industry

* electronic publishing and content management across the Internet

* risk management systems for financial services companies

* government and commercial geographic information systems (GISs)

Establishing an early leadership position in any one of these markets could easily account
for another billion dollars in revenue for Informix. This would surely justify the time and
cost required to rearchitect its core engine around the Illustra technology and position
Informix as a player in the object/relational database market.

Delivery of the Informix-Universal Server is slated to take place in three phases:

1. delivery of a gateway to allow customers to access complex data stored in an Illustra


Server and integrate it with traditional relational data in an Informix server (the second
quarter of 1996)

2. delivery of a DataBlades Developer Tool Kit for creating new user-defined data types
that work in both the Illustra Server and the new Informix-Universal Server (the second

ANNAMALAI
ANNAMALAI UNIVERSITY
quarter of 1996)
UNIVERSITY
3. delivery of the fully merged Informix-Universal Server technology including "snap in"
DataBlades (the fourth quarter of 1996)

a. Riding Waves

To some extent, you could argue that Informix (like competitors Oracle and Sybase) has
surfed the technology wave of relational databases and Unix-based open systems that has
swept across corporations over the last decade. Another more recent wave, data
warehousing, is far from peaking, and Informix hedged its bets in this area with its

Page 28
Advanced RDBMS

acquisition of the San Francisco-based Stanford Technology Group (STG). STG is


known for its MetaCube product, which presents a multidimensional view of underlying
relational data through the use of an intermediary metadata layer. This lets users of
Informix RDBMS servers carry out online analytical processing (OLAP) by using the
MetaCube technology. Informix already has a major data warehouse implementation
underway at the Consumer Market Division of communications giant MCI. This data
warehouse is expected to grow from a 600GB data mart up to three terabytes.

Oracle and Sybase have also taken initiatives in this area and are integrating OLAP
technology into their product lines to ensure that they lose as few possible sales to
multidimensional server vendors such as Arbor Software (Sunnyvale, Calif.), which sells
the Essbase Analysis Server, or to specialized data warehouse server vendors such as Red
Brick Systems (Los Gatos, Calif.). The data warehousing wave provides database
vendors the chance to offer an application that is no more than their current database
engine and some combination of front-end query and reporting tools. The data warehouse
solution from Informix also benefits from its built-in parallel processing functionality and
log-based "continuous" data replication services for populating the data warehouse from
other Informix servers. Leading U.K. database analysts Bloor Research Group cited
Informix's DSA as "the best all-round parallel DBMS on the market" and claimed it "has
significant benefits over almost all its competitors on data warehouse applications"
("Parallel Database Technology: An Evaluation and Comparison of Scalable Systems,"
Bloor Research Group, October 1995).

b. Going Mobile

International Data Corp. forecasts suggest that shipments of laptop computers will grow
from four million in 1995 to some eight million in 1999 in the U.S. alone. In other words,
the road warrior population is set to at least double, and as more workers telecommute
and the influence of the Internet makes itself felt in the business world, the term "office"
will simply come to mean "where you are at this point in time." To support this scenario,
Informix is working on its "anytime, anywhere" strategy, which sounds suspiciously
similar to the concepts espoused by Sybase for its SQL Anywhere server product based
on the recently acquired Watcom SQL engine.

However, the key to Informix's strategy for the mobile computing market is
asynchronous messaging based on new middleware products being built by Informix that
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
provide store-and-forward message delivery and the use of software agents to manage the
process. Asynchronous messaging lets mobile clients send and receive messages without
maintaining a constant connection with the server. Store-and-forward message delivery
ensures that messages get sent or completed as soon as a connection is established or
reestablished. The middleware and software agents are used to establish and maintain
connections, to automate repetitive tasks, and to intelligently sort and save information.
The applications that deliver this functionality can be created using the Informix class
libraries built in the Informix NewEra tool, which allows for application partitioning to
deploy components on mobile clients or servers.

Page 29
Advanced RDBMS

c. New Era of RAD

NewEra is Informix's rapid application development tool that competes with Powersoft's
(a Sybase company) PowerBuilder and Oracle's Developer 2000. Compared to its
competitors, NewEra benefits from a strong object-oriented design that delivers a
repository-based, class library-driven application development paradigm using class
browsers for navigating application objects. NewEra can also generate cross-platform
applications. Specifically, NewEra includes:

* a graphical window and form painter with a code generator

* a graphical front end for managing NewEra application components

* a graphical language editor for managing NewEra code

* an interactive, graphical debugger for analyzing NewEra programs

* repositories, class browsers, and configuration tools supporting team-based


development

* reusable class libraries that can be Informix or third party provided or developer
defined

The impending release of the latest version of NewEra, expected in the second quarter of
1996, is slated to deliver user-defined application partitioning for three-tier client/server
deployment; OLE remote automation server support to allow OLE clients to make
requests against NewEra built application servers; and class libraries to support
transaction-processing monitors for load balancing of high volume OLTP applications. If
this functionality is delivered as promised, then client/server application vendors such as
Concepts Dynamic (Schaumburg, Ill.), whose Control suite of accounting applications is
written in NewEra, will benefit from their use of Informix technology.

d. The Web Word

Informix, like everyone these days, is hot on the Web word. World Wide Web Interface

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Kits are available for use by Informix customers building Web applications using
Informix-4GL or Informix-ESQL/C tools that need to use the common gateway interface
(CGI) as a means to access Informix databases across the Internet. Informix has
established a Web partner program to build links with other Web software developers
such as Bluestone Inc.(Mountain View, Calif.) and Spider Technologies (Palo Alto,
Calif.). Informix customers such as MCI, Choice Hotels, and the Internet Shopping
Network are already forging ahead with Informix-based Web solutions. Illustra (now
owned by Informix) also recently collaborated with other partners to deliver "24 Hours in
Cyberspace." This event, claimed to be the largest online publishing event ever staged,
allowed the organizers to create a new web page every 30 minutes comprising

Page 30
Advanced RDBMS

multimedia content delivered from hundreds of sites worldwide and stored in an Illustra
DBMS.

Informix also partnered with Internet darling Netscape Communications Corp. to include
the Informix-OnLine Workgroup Server RDBMS as the development and deployment
database for Netscape's LiveWire Pro. The LiveWire Pro product is part of Netscape's
SuiteSpot Web application development system for building online applications. This
deal involves cross-licensing and selling of Informix and Netscape products and is likely
to be among the first of many such collaborations between database and Internet vendors
during 1996.

e. SmartCards and Internet Personal Communicators (IPC)

While the IPC vs. PC debate rages on in the press, let me put a spin on this scenario for
you. You are a road warrior and before leaving on a trip you slip your personal profile
SmartCard (PPS) into your jacket pocket and leave the laptop at home. Your PPS
contains your personal login information and access numbers for Internet and Intranet
connectivity. Eventually this PPS may also be software agent-trained to search for news
on specific subjects, and may contain a couple of Java applets for corporate Intranet
application front ends to submit your T&E (travel and entertainment) and review your
departmental schedule. When you check into your room, there is an IPC designed
specifically for OLIP (online Internet processing).

This IPC, which costs your hotel the same amount as the TV in your room, is a combined
monitor, PPS reader, and keyboard/mouse already plumbed into the Internet. You switch
on the IPC and with one swipe of your PPS in the reader you upload all your profile data
into the IPC's local memory. While this is taking place, the hotel uses the opportunity to
display its home page, welcoming you to the hotel, advertising goods and services, and, if
you are a regular guest, showing you your current bill and your frequent guest program
status. You then fire up your favorite browser to process some email, set your software
agent off to collect the news, submit your trip expenses to the home office Intranet, and
review your current schedule to book a few calls and juggle some appointments. All of
this was done without a laptop or personal computer in sight and depends only on a
simple device connected to the Internet and a SmartCard.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SmartCards are another technology on which Informix is working together with its
partners, Hewlett-Packard (Palo Alto, Calif.) and GemPlus Card International Corp.
(Gaithersburg, Md.). SmartCards will be used for all sorts of applications including
buying, identifying, and securing things. It is not hard to see SmartCards being carried by
everyone and combining your credit card, phone card, driver's license, and medical alert
data onto one slim "plastic" database.

f. Putting the Right Foot Forward

It's hard to see Informix taking a wrong step at the moment. The positioning of the
Informix-Universal Server, the complementary strategies of mobile computing, Web-

Page 31
Advanced RDBMS

enabling, and SmartCards show some good, focused vision. Phillip White's record, as
well as that of on-staff database gurus such as Dr. Michael Stonebraker of Ingres/Illustra
fame and Mike Saranga of DB2 fame, all shows the proven ability to execute these
strategies successfully. Sounds like a recipe for success to me.

1.2.11 Object- Relational Features of Oracle 8i

Oracle 8i server server software has many optional components to chose from
The Oracle 8i server software
Net8 Listener
The Oracle8i utilities
SQL * Plus
A starter database
Object spatial helps to data mapping and handling.
An instance acan be started and open a database in restricted mode so that the database is
available only to administration personnel. This mode helps to accomplish the following
tasks.
 Perform structure maintenance, such as rebuilding indexes.
 Perform an export or import of database data
 Perform a data load with SQL * Loader
 Temporarily prevent typical users from using data.

1.2.12 An Overview of SQL

 The SQL Standard and Its Components


Structured Query Language is a high level language that was developed to provide
access to the data contained in relational databases. SQL has been widely adopted and
now almost all contemporary databases can be accessed using SQL. The American
National Standards Institute (ANSI) has standardized the SQL language. SQL server
uses a dialect of SQL called Transact-SQL. Transact-SQL contains several flow
control keywords that facilitate its use for developing stored procedures.
The SQL is used for database management tasks such as creating and dropping tables
and columns, for writing triggers and stored procedures. It is also used to change SQL
servers configuration. It is also interactively used with SQL servers Graphical Query
Analyser utility to perform unplanned queries.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
 Object-Relational Support in SQL-99
SQL server database objects consists of Tables, Columns, indexes, views, constraints,
rules, defaults, triggers, stored procedures, and extended stored procedures.
As you can set in the table, each SQL server table contains a set of related information
where each table represents a different object in the publishing business.

 Some New Operations and Features in SQL


Transact SQL provides three categories of SQL support. DDL (Data Definition
Language), DML (Data Manipulation Language), and DCL (Data control Language)

Page 32
Advanced RDBMS

SQL: server enterprise manager is a graphical client/server administration and


management tool that allows to perform database management, backup, restore
operations, set up security and database replication.

1.2.13 Implementation & related issues for extended type systems

 Managing Large Objects and Other Storage Features

Like C++, Oracle 8 provides built in constructors for values of a declared type and these
constructors bear the name of the type. Thus, the word point type and a parenthesized list
of appropriate values form a value of type point type.

One of the most important parts of an Oracle database is its data dictionary. Data
Dictionary is a read-only set of tables that provide information about its associated
database. Dynamic performance tables are not true tables, and most users should not
access them. However, database administrators can query and create views on the tables
and grant access to those views to other users . These views are sometimes called fixed
views because they cannot be altered or removed by the database administrator.

 The Nested Relational Data Model

The nested relational data model is a natural generalisation of the relational data model,
but it often leads to designs which hide the data structures needed to specify queries and
updates in the information system. The relational data model on the other hand exposes
the specifications of the data structures and permits the minimal specification of queries
and updates using SQL. The deficiencies in relational systems leading to a demand for
object-oriented nested relational solutions are seen to be deficiencies in the
implementations of relational database systems, not in the data model itself. The nested
relational data model is a natural generalisation of the relational data model, but it often
leads to designs which hide the data structures needed to specify queries and updates in
the information system. The relational data model on the other hand exposes the
specifications of the data structures and permits the minimal specification of queries and
updates using SQL. However, there are deficiencies in relational systems, which lead to a
demand for object-oriented nested relational solutions. This paper argues that these
deficiencies are not inherent in the relational data model, but are deficiencies in the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
implementations of relational database systems.

The paper first sketches how the nested-relational model is a natural extension of the
object-relational data model, then shows how the nested relational model, while sound, is
expensive to use. It then examines the object-oriented paradigm for software engineering,
and shows that it gives very little benefit in database applications. Rather, the relational
model as represented in conceptual modeling languages is argued to provide an ideal
view of the data. The ultimate thesis is that a better strategy is to employ a main-memory
relational database optimised for queries on complex objects, with a query interface
based on a conceptual model query language. Object-relational data model leads to
nested relations

Page 33
Advanced RDBMS

The object-relational data model (Stonebraker, Brown and Moore 1999) arises out of the
realisation that the relational data model abstracts away from the value sets of attribute
functions. If we think in terms of tuple identifiers in relations (keys), then a relation is
simply a collection of attribute functions mapping the key into value sets.

The pure relational data model is based on set theory, and operates in terms of
projections, cartesian products and selection predicates. Cartesian product simply creates
new sets from existing sets, while projection requires the notion of identity, since the
projection operation can produce duplicates, which must be identified. Selection requires
the concept of a predicate, but the relational model abstracts away from the content of the
predicate, requiring only a function from a tuple of value sets into {true, false}. The
relational system requires only the ability to combine predicates using the prepositional
calculus.

Particular value sets have properties which are used in predicates and in other operations.
The only operator used in the pure relational model is identity. The presence of this
operator is guaranteed by the requirement that the value sets be sets, although in practice
some value sets do not for practical purposes support identity (eg real number represented
as floating point).

This realisation that the relational data model abstracts away from the types of value sets
and from the operators which are available to types has allowed the design of database
systems where the value sets can be of any type. Besides integers, strings, reals, and
booleans, object-relational databases can support text, images, video, animation,
programs and many other types. Each type supports a set of operations and predicates
which can be integrated with the relational operations into practical solutions (each type
is an abstract data type).

If a value set can be of any type, why not a set of elements of some type? Why not a
tuple? If we allow sets and tuples, then why not sets of tuples? Sets of tuples are relations
and the corresponding abstract data type is the relational algebra. Thus the object
elational data model leads to the possibility of relation-valued attributes in relations.
Having relation-valued attributes in relations looks as if it might violate first normal
form. However, the outer relational operations can only result in tuples whose attribute
values are either copies of attribute values from the original relations or are functions of
those values, in the same way as if the value sets were integers, the results are either the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
integers present in the original tables or functions like square root of those integers. In
other words, the outer relational system can only see inside a relation-valued attribute to
the extent that a function is supplied to do so. These functions are particular to the
schema of the relation-valued attribute, and have no knowledge of the outer schema.
Since the outer relational model and the abstract data type of a relation-valued attribute
re the same abstract data type, it makes sense to introduce a relationship among the two.

The standard relationships are unnest and nest. Unnest is an operator which modifies the
scheme of the outer data model, replacing the relation-valued attribute function by a
ollection of attribute functions corresponding to the scheme of the inner relation. Nest is

Page 34
Advanced RDBMS

he reverse operation, which modifies the outer scheme by packaging a collection of


attributes into a single relation-valued attribute.

Having relation-valued attributes together with nest and unnest operations between the
outer and inner relational systems is called the nested relational data model. We see that
the nested relational data model is a natural extension of the object-relational data model.

 Use of the nested relational data model for object-oriented development

In recent years the object-oriented model has become the dominant programming model
and is becoming more common in systems design, including information systems. The
data in an object-oriented system consists typically of complex data structures built from
tuple and collector types. The tuple type is the same as the tuple type in the
bjectrelational model. A collector type is either a set, list or multiset. The latter two can
be seen as sets with an additional attribute: a list is a set with a sequence attribute, while a
multiset is a set with an additional identifying attribute. So a nested-relational data model
can represent data from an object-oriented design. Accordingly, object-relational
databases with object-relational nested SQL can be used to implement object-oriented
databases. How this is done is described for example by Stonebraker , Brown and Moore
(1999) (henceforth SBM). We should note that both the relational and object-oriented
data models are implementations of more abstract conceptual data models expressed in
conceptual data modelling languages such as the Entity-Relationship-Attribute (ERA)
method. Wellesta lished information systems design methods begin the analysis of data
with a conceptual model, moving to a particular database implementation at a later stage.
An example adapted from SBM will clarify some issues.

Consider the E- R data model in Figure 1

Figure 1: A Conceptual Model

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Figure 1. Since the relationship between department and vehicle is one-to-many,
associated with each department is a set of vehicles.

Figure 1 A conceptual model A nested-relational implementation of this conceptual data


model is
Dept(ID:int, other: various, (1)
car: set of (vehID: string, make:string, year:int))

 OR SQL on nested relations


SQL has been extended by SBM among others to handle object-relational databases,
mainly by permitting in SQL statements the predicates and operators particular to the

Page 35
Advanced RDBMS

abstract data types supporting the value sets. In particular, extending and overloading the
dot notation for disambiguating attribute names support nested relational systems.

For example, in
Select ID from dept where car.year = 1999 (2)

Dot year identifies the year attribute of the car tuple, and also designates the membership
of a tuple where year = 1999 in the set of tuples which is the value set of dept.car. The
result of this query on the table of Figure 2 is ID = 1.

As a consequence of this overloading, the and boolean operator in the WHERE clause
becomes if not ambiguous, at least counterintuitive to someone used to standard SQL.

The query Select ID from dept where car.make = Laser (3)


Has the same result, ID = 1. Since the outermost interpretation of dot is set membership,
in the query

Select ID from dept where car.year = 1999 and car.make = Laser (4)
The and operator is interpreted as set intersection, and the result is also ID = 1.
This result, although correct, is probably not what the maker of the query intended. They
would more likely have been looking for a department, which has a 1999 Laser, and the
response they would be looking for would be none.
There are two ways to fix this problem. One is to import a new and operator from the
relational ADT, so that (4) becomes

Select ID from dept where car.year = 1999 and2 car.make = Laser (5)
In this solution, both arguments of and2 must be the same relation-valued attribute of the
outer system.

The other solution is to unnest the table so that the standard relational operator works in
the way it does in standard SQL

Select ID from dept, dept.car where (6)


car.year = 1999 and2 car.make = Laser

Where the addition of dept.car to the FROM clause signifies unnesting.


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The former method is problematic since nesting can occur to any level, and the second is
problematic since it requires the user to introduce navigation information into the query.

The same sort of problem occurs when we try to correlate the SELECT clause with the
WHERE clause

Select ID, car.year from dept where car.make = Laser (7)

Returns the table

Page 36
Advanced RDBMS

ID = 1, Year = {1991, 1999} (8)

when applied to the table of Figure 2, as a consequence of first normal form. We need
again to use unnest to convert the nested structure to a flat relational structure in order to
make the query mean what we want to say. Although OR SQL is a sound and complete
query language, the simple-looking queries tend to be not very useful, and in order to
make useful queries additional syntax and a good understanding of the possibly complex
and possibly multiple nesting structure is essential. The author’s experience is that it is
very hard to teach, even to very advanced students.

 Representation of many to many relationships


If we are going to use the nested relational model to represent complex data structures,
then we must take account of many to many relationships

Figure 2: A Many to Many Relationship

There are several different ways to implement this application in the nested relational
model, taking each of the entities as the outermost relation. If implemented as a single
table, two of the entities would be stored redundantly because of the many-to-many
relationships. So the normalised way is to store the relationships as sets of reference types
(attributes whose value sets are object identifiers).

If the query follows the nesting structure used in the implementation, then we have only
the problems of correlation of various clauses in the SQL query described in the last
section.

However, if the query does not follow the nesting structure, it can get very complex. For
example, if the table has a set of courses associated with each student and a set of
lecturers associated with each course, then in order to find the students associated with a
given lecturer, the whole structure needs to be unnested, and done so across reference

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
types. The query is hard to specify, and would be very complex to implement.

One might argue that one should not use the nested relational model for many to many
relationships. But nested systems can interact, as in Figure 3.

Page 37
Advanced RDBMS

Figure 3: A many-to-many with nesting

In this case, an event has a set of races, and a team has a set of competitors, and we have
to decide whether a race has a set of references to competitor or vice versa. What if we
want to find what events a team participates in? The whole structure must be unnested.

The point is that representing these commonly occurring complex data structures using a
nested relational model are very much more complex then representing them in the
standard relational model.

 Reconsideration of using NR model for OO concepts


We have seen that the nested relational model arises naturally from the object-relational
model, and that it has a sound and complete query language based on first normal form.

However, we have seen several practical problems:

 Using the NR model forces the designer to make more choices at the database
schema level than if the standard relational model is used.
 A query on a NR model must include navigation paths.
 A query must often unnest complex structures, often very deeply for even
semantically simple queries.

So even though the nested relational model is sound, it is very much more difficult to use
than the standard relational model, so may be thought of as much more expensive to use.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In order for a more expensive tool to be a sound engineering choice, there must be a
corresponding benefit. Let us therefore look at the benefits of the object-oriented
programming model.

OO programming and design originated in the software engineering domain. In this


domain, it is considered beneficial to hide the details of the implementation of a program
specification. This information hiding makes use of objects more transparent, and ensures
that modifications made to objects, which do not affect functionality, may be made
without side effects. The principles of information hiding were a major advance in
software engineering.

Page 38
Advanced RDBMS

The benefits of using an OO approach in a database therefore would come from


information hiding, that is hiding implementation details not required for understanding
the specification of an object.

Let us see how this applies to the specification of data in an information system. As we
have seen, it is common to use a conceptual modelling technique to specify such data.
The implementation of this data is ultimately in terms of disk
addresses, file organisations and access methods, but is generally done in several stages.

The first stage of implementation is normally the specification of schemas in a database


data description language, very often in a relational database system. This stage of
implementation is almost a transliteration, frequently introducing no additional design
decisions. Algorithms for the purpose are given for example by Elmasri and Navathe
(2000).

Further stages of implementation are performed almost entirely within the database
manager software (DBMS), sometimes with the guidance of a database administrator
who will identify attributes of tables which need rapid access, or give the DBMS some
parameters which it will use to choose among pre-programmed design options. In effect,
the implementation of the data model is almost entirely automated, and generally not the
concern of the applications programmer.

So the conceptual data model is a specification, the almost equivalent DBMS table
schemas are in effect also specifications, and the programmer does not generally proceed
further with refinement.

On the programming side, an information system generally has a large number of


modules which update or query the tables. In a relational system, these programs are
generally written using the SQL data manipulation language.

The SQL statement is at a very high level, and is generally also refined in several stages:

 The order of execution of the various relational operators must be chosen.


 Various secondary and primary indexes can be created or employed
 Decisions need to be made as to the size of blocks retrieved from disk, what is to
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
be cached in main memory, whether intermediate results need to be sorted, and
what sort algorithms to use.

But, again, these refinement decisions are made by the DBMS using pre-programmed
design decisions depending on statistics of the tables held in the system catalog and to a
degree on parameters supplied by the database administrator. The programmer is
generally not concerned with them.

So it makes sense to think of an SQL statement not as a program but as a specification for
a program. It is hard to see what might be removed from an SQL statement while

Page 39
Advanced RDBMS

retaining the same specified result. The SELECT clause determines which columns are to
appear in the result, the FROM clause determines which tables to retrieve data from (in
effect which entities and relationships the data is to come from), and the WHERE clause
determines which rows to retrieve data from.

We have that the benefits of information hiding in object-oriented design is that the
programmer can work with the specifications of the data and methods of a system
without having to worry about how the specifications are implemented. However, in
information systems, the programmer works only with specifications of data structures
and access/ update methods. The implementation is hidden already in the DBMS. So in a
DBMS environment the programmer never has to worry how the specifications are
implemented. Information hiding is already employed no matter what design method the
programmer uses.

What the nested relational data model does is hide aspects of the structure of the specified
data, whereas the standard relational model exposes the specified structure of the data.

Using the NR data model, the data designer must make what amount to packaging design
decisions in the implementation of a conceptual model. In this sense, a NR model is more
refined than a standard relational model, and is therefore more expensive to build. On the
other hand, when a query is planned, in the NR model the programmer, besides
specifying the data that is to appear in the query, must also specify how to unpackage the
data to expose sufficient structure to specify the result. So as we have seen, the query is
also more expensive. Both the data representation and the query are unnecessarily more
expensive than the standard relational representation, since the information being hidden
is part of the specification, not how the specifications are implemented.

 So why don’t people use RDBs for OO applications?

One might ask why people don’t already use relational databases for problems calling for
object-oriented approaches. The usual reason given is that RDBs are too slow. The
paradigmatic object-oriented application is system design, say a VLSI design or the
design of a large software system. There is often only one (very complex) object in the
system. This object has many parts, which are themselves complex. A relational
implementation therefore calls for many subordinate tables with limited context; and

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
processing data in the application generally requires large numbers of joins.

Database managers tend to be designed to support transactional applications, where there


are a large number of objects of limited complexity. The space of pre-programmed design
options for the implementation of data structures and queries does not generally extend to
the situation where there are a small number of very complex objects.

Rejection of the standard relational data model for these applications is therefore not a
rejection of the model per se, but a recognition that current implementations of the
standard relational data model do not perform well enough for these problems.

Page 40
Advanced RDBMS

 What can be done?

Two problems have been identified which make the standard relational model difficult to
use for OO applications: the slowness of the implementation and the necessity for the
definition of a large number of tables with limited context.
The former problem is technical. A large amount of investment has been made in the
design of implementations for transaction-oriented applications. Given sufficient
effective demand, there is no reason why a sufficient investment can not be made for
applications of the OO type. In particular, there are already relational database systems
optimised around storage of data primarily in main memory rather than on disk. For
example, a research project of National Research Institute for Mathematics and Computer
Science in the Netherlands together with the Free University of Amsterdam, called
Monet, has published a number of papers on the various design issues in this area. A
search on the Web identifies many such products. The problem of slowness of standard
relational implementations for OO applications can be taken to be on the way to solution.

The latter problem, that the data definition for an OO application requires a large number
of tables with limited context, is a problem with the expressiveness of the standard
relational data model. In an OO application one frequently wants to navigate the complex
data structures specified. One might want the set of teams participating in a particular
race in a particular event, or the set of events in
which a particular competitor from a particular team is competing, or the association
between teams and events defined by the many-to-many relationship between Race and
Competitor. From the point of view of each of those queries, there is a nested-relational
packaging of the conceptual model which makes the query simple, simpler than the
standard relational representation. The unsuitablity of the NR model is that these NR
packagings are all different, and that a query not following the chosen packaging
structure is very complex.

However, we have already seen that the primary representation of the data can be in a
conceptual model. The relational representation can be, and generally is, constructed
algorithmically. If the DBMS creates the relational representation of the conceptual
model, then the conceptual model should be the basis for the query language. A query
expressed on the conceptual model can be translated into SQL DML in the same sort of
way that the model itself is translated into SQL DDL. In fact, there are a number of

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
conceptual query languages which permit the programmer to construct a query by
specifying a navigation through the conceptual model, for example ConQuer (Bloesch
and Halpin, 1996, 1997).

Using a language like ConQuer, the programmer can specify a navigation path through
the conceptual model, which when it traverses a one-to-many relationship opens the set
of instances on the target side. When it traverses a many-to-many relationship, the view
from the source of the path looks like a one-to-many. Such a traversal of the conceptual
model provides a sort of virtual nested-relational data packaging, which can be translated
into standard SQL without the programmer being aware of exactly how the data is

Page 41
Advanced RDBMS

packaged. This approach therefore is more true to the spirit of object-oriented software
development since the implementation of the specification is completely hidden.

1.2.14 Conclusion

The standard relational data model where the DDL and DML are both hidden beneath a
conceptual data modelling language and the DBMS is a main-memory implementation
optimised for OO-style applications, presents a much superior approach to the problem of
OO applications than does the nested relational data model.

1.3. Revision Points

 Information is represented in object-oriented database, in the form of objects as


used in Object-Oriented Programming.
 A database is a logical term used to refer a collection of organized and related
information.
 Operator polymorphism: It refers to an operation’s ability to be applied to
different types of objects; in such a situation, an operation name may refer to
several distinct implementations, depending on the type of objects it is applied to.
This feature is also called operator overloading.
 ODMG standard refers to - object model, object definition language (ODL),
object query language (OQL), and bindings to object-oriented programming
languages.
 OQL Collection Operators include Aggregate operators such as: min, max,
count, sum, and avg.
 Structured Query Language is a high level language that was developed to
provide access to the data contained in relational databases. SQL has been widely
adopted and

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
1.4. Intext Questions

1. Illustrate ODMG
2. What is C++ Language Binding ?
3. Explain what is the concept of Object Oriented Databases ?
4. Define Object Definition Language .
5. Write a note on Object Query Language.
6. The usage of CORBA in Database management – Discuss.
7. Explain Entity Relationship Diagram ?

Page 42
Advanced RDBMS

1.5. Summary

o The term "object-oriented database system" first appeared around 1985..


o OO databases try to maintain a direct correspondence between real-world and
database objects so that objects do not lose their integrity and identity and can
easily be identified and operated upon
o The three most basic constructors are atom, tuple, and set. Other commonly used
constructors include list, bag, and array. The atom constructor is used to represent
all basic atomic values, such as integers, real numbers, character strings,
Booleans, and any other basic data types that the system supports directly.
o Extents: In most OO databases, the collection of objects in an extent has the same
type or class since the majority of OO databases support types, we assume that
extents are collections of objects of the same type
o Persistent Collection: It holds a collection of objects that is stored permanently in
the database and hence can be accessed and shared by multiple programs
o Transient Collection: It exists temporarily during the execution of a program but
is not kept when the program terminates
o We made the ODMG object model much more comprehensive, added a
metaobject interface, defined an object interchange format, and worked to make
the programming language bindings consistent with the common model. We
made changes throughout the specification based on several years of experience
implementing the standard in object database products.
o The goal of this Object Definition Language (ODL) is to capture enough
information to be able to generate the majority of most SMB web apps directly
from a set of statements in the language . . .
o The C++ binding to ODBMSs includes a version of the ODL that uses C++
syntax, a mechanism to invoke OQL, and procedures for operations on databases
and transactions

1.6. Terminal Exercise

1. What is Database?
2. Define ODL, OQL?
3. What is Polymorphism?
4. What do you mean by OOAD?
5. ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
What is the main use of CORBA?

1.7. Suggested Reading

1. Bloesch, A. and Halpin, T. (1996) “ConQuer: a Conceptual Query Language”


Proc.ER’96: 15th International Conference on Conceptual Modeling, Springer
LNCS, no. 1157, pp. 121-33.
2. Bloesch, A. and Halpin, T. (1997) “Conceptual Queries Using ConQuer-II” in.
David W. Embley, Robert C. Goldstein (Eds.): Conceptual Modeling - ER '97,
16th International Conference on Conceptual Modeling, Los Angeles, California,

Page 43
Advanced RDBMS

USA, November 3-5, 1997, Proceedings. Lecture Notes in Computer Science


1331 Springer 1997.
3. Elmasri, R. & Navathe, S. B. (2000). Fundamentals of Database Systems. (3rd
ed.).
4. Addison Wesley, Reading, Mass. Stonebraker, M., Brown, P. and Moore, D.
(1999) Object-relational DBMSs : tracking the next great wave San Francisco, Calif.
: Morgan Kaufmann Publishers.

1.8 Assignments

1. By using C++ write the ODL statements to fetch the data from the Inventory
database.

1.9 Reference Books

Ramez Elmasri, Shamkant B. Navathe, “Fundamentals of Database Systems”, Addison –


Wesley, 2000.

1.10 Learning Activities

1. Discuss in detail about SQL.


2. The usage of CORBA in Database management – Discuss.

1.11 Keywords
1. Object-Oriented Database
2. ORDBMS - Object Relational Database Management System
3. ODMG – Object Database Management Group.
4. ODL – Object Definition Language
5. OQL – Object Query Language.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 44
Advanced RDBMS

UNIT- II

Topics:
 Functional Dependencies & Normalization For Relational Database
 Normal Forms Based on Primary Keys
 General Definitions of Second and Third Normal Forms
 Boyce-Codd Normal Form
 Algorithms for Relational Database Schema Design
 Multivalued Dependencies and Fourth Normal Form
 Join Dependencies and Fifth Normal Form
 The Database Design Process

2.0 Introduction

E. F. Codd, in the early 1970's using relational mathematics, devised a system where
tables can be designed in such a way that certain "anomalies" can be eliminated by the
selection of which columns (attributes) to be included in the table. Since relational
mathematics is based upon "relations", it is assumed that all tables in this discussion
satisfy the assumptions incorporated in a relation, mentioned earlier. The widespread use
of the relational database model is a fairly recent phenomenon because the operation of
joining tables requires considerable computer resources and it is only in recent years that
computer hardware is such that large relational databases can be satisfactorily
maintained.

Suppose we are now given the task of designing and creating a database. Good database
design needless to say, is important. Careless design can lead to uncontrolled data
redundancies that will lead to problems with data anomalies.

In this chapter we will examine a process known as Normalisation - a rigorous design


tool that is based on the mathematical theory of relations which will result in very
practical operational implementations. A properly normalised set of relations actually
simplifies the retrieval and maintenance processes and the effort spent in ensuring good
structures is certainly a worthwhile investment. Furthermore, if database relations were
simply seen as file structures of some vague file system, then the power and flexibility of
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
RDBMS cannot be exploited to the full.

2.1 Objective

a. A Bad Design

E.Codd has identified certain structural features in a relation which create retrieval and
update problems. Suppose we start off with a relation with a structure and details like:

Page 45
Advanced RDBMS

Simple structure

This is a simple and straightforward design. It consists of one relation where we have a
single tuple for every customer and under that customer we keep all his transaction
records about parts, up to a possible maximum of 9 transactions. For every new
transaction, we need not repeat the customer details (of name, city and telephone), we
simply add on a transaction detail.

Note the following disadvantages:

 The relation is wide and clumsy


 We have set a limit of 9 (or whatever reasonable value) transactions per customer.
What if a customer has more than 9 transactions?
 For customers with less than 9 transactions, it appears that we have to store null
values in the remaining spaces. What a waste of space!
 The transactions appear to be kept in ascending order of P#s. What if we have to
delete, for customer Codd, the part numbered 1- should we move the part
numbered 2 up (or rather, left)? If we did, what if we decide later to re-insert part
2? The additions and deletions can cause awkward data shuffling.

Let us try to construct a query to "Find which customer(s) bought P# 2" ? The query
would have to access every customer tuple and for each tuple, examine every of its
transaction looking for

(P1# = 2) OR (P2# = 2) OR (P3# = 2) ... OR (P9# = 2)

A comparatively simple query seems to require a clumsy retrieval formulation!


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
b. Another Bad Design

Alternatively, why don't we re-structure our relation such that we do not restrict the
number of transactions per customer. We can do this with the following structure:

Page 46
Advanced RDBMS

This way, a customer can have just any number of Part transactions without worrying
about any upper limit or wasted space through null values (as it was with the previous
structure).

Constructing a query to "Find which customer(s) bought P# 2" is not as cumbersome as


before as one can now simply state: P# = 2.

But again, this structure is not without its faults:

 It seems a waste of storage to keep repeated values of Cname, Ccity and Cphone.
 If C# 1 were to change his telephone number, we would have to ensure that we
update ALL occurrences of C# 1's Cphone values. This means updating tuple 1,
tuple 2 and all other tuples where there is an occurrence of C# 1. Otherwise, our
database would be left in an inconsistent state.
 Suppose we now have a new customer with C# 4. However, there is no part
transaction yet with the customer as he has not ordered anything yet. We may find
that we cannot insert this new information because we do not have a P# which
serves as part of the 'primary key' of a tuple. Suppose the third transaction has
been canceled, i.e. we no longer need information about 25 of P# 1 being ordered
on 26 Jan. We thus delete the third tuple. We are then left with the following
relation:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

But then, suppose we need information about the customer "Martin", say the city he is
located in. Unfortunately as information about Martin was held in only that tuple and
having the entire tuple deleted because of its P# transaction, meant also that we have lost
all information about Martin from the relation.

Page 47
Advanced RDBMS

As illustrated in the above instances, we note that badly designed, unnormalised relations
waste storage space. Worse, they give rise to the following storage irregularities:

 Update anomaly: Data inconsistency or loss of data integrity can arise from data
redundancy/repetition and partial update.
 Insertion anomaly: Data cannot be added because some other data is absent.
 Deletion anomaly: Data maybe unintentionally lost through the deletion of other
data.

2.2 Content

2.2.1 Functional Dependencies & Normalization For Relational Databases

Informal design guidelines for relational schemas

Intuitively, it would seem that these undesirable features can be removed by breaking a
relation into other relations with desirable structures. We shall attempt by splitting the
above Transaction relation into the following two relations, Customer and Transaction,
which can be viewed as entities with a one to many relationship.

Figure 4-2: 1:M data relationships

Let us see if this new design will alleviate the above storage anomalies:

a. Update anomaly

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If C# 1 were to change his telephone number, as there is only one occurrence of the tuple
in the Customer relation, we need to update only that one tuple as there are no
redundant/duplicate tuples.

b. Addition anomaly

Adding a new customer with C# 4 can be easily done in the Customer relation of which
C# serves as the primary key. With no P# yet, a tuple in Transaction need not be created.

Page 48
Advanced RDBMS

c. Deletion anomaly

Canceling the third transaction about 25 of P# 1 being ordered on 26 Jan would now
mean deleting only the third tuple of the new Transaction relation above. This leaves
information about Martin still intact in the new Customer relation.

This process of reducing a relation into simpler structures is the process of


Normalisation.

Normalisation may be defined as a step by step reversible process of transforming an


unnormalised relation into relations with progressively simpler structures. Since the
process is reversible, no information is lost in the transformation.

Normalisation removes (or more accurately, minimises) the undesirable properties by


working through a series of stages called Normal Forms. Originally, Codd defined three
types of undesirable properties:

 Data aggregates
 Partial key dependency
 Indirect key dependency

and the stages of normalisation that remove the associated problems are defined below.

We shall now show a more formal process on how we can decompose relations into
multiple relations by using the Normal Form rules for structuring.

According to (Elmasri & Navathe, 1994), the normalization process, as first proposed by
Codd (1972), takes a relation schema through a series of tests to "certify" whether or not
it belongs to a certain normal form. Initially, Codd proposed three normal forms, which
he called first, second, and third normal form. A stronger definition of 3NF was
proposed later by Boyce and Codd and is known as Boyce-Codd normal form (BCNF).
All these normal forms are based on the functional dependencies among the attributes of
a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were
proposed, based on the concepts or multivalued dependencies and join dependencies,
respectively.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Functional Dependencies

Normalization of data can be looked on as a process during which unsatisfactory relation


schemas are decomposed by breaking up their attributes into smaller relation schemas
that possess desirable properties. One objective of the original normalization process is to
ensure that the update anomalies do not occur.

Page 49
Advanced RDBMS

Normal forms provide database designers with:

 A formal framework for analyzing relation schemas based on their keys and on
the functional dependencies among their attributes.
 A series of tests that can be carried out on individual relation schema so that the
relational database can be normalized to any degree. When a test fails, the
relation violating that test must be decomposed into relations that individually
meet the normalization tests.

Normal forms, when considered in isolation from other factors, do not guarantee a good
database design. It is generally not sufficient to check separately that each relation
schema in the database is, say, in BCNF or 3NF. Rather, the process of normalization
through decomposition must also confirm the existence of additional properties that the
relational schemas, taken together, should possess. Two of these properties are:

 The lossless join or nonadditive join property, which guarantees that the spurious
tuple problem does not occur.
 The dependency preservation property, which ensures that all functional
dependencies are represented in some of the individual resulting relations.

Let's begin by creating a sample set of data. Imagine we are working on a system to keep
track of employees working on certain projects.

Project Employee Rate Hourly


Project name Employee name
number number category rate
Madagascar travel
1023 11 Vincent Radebe A $60
site
12 Pauline James B $50
Charles
16 C $40
Ramoraz
Online estate
1056 11 Vincent Radebe A $60
agency
Monique
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY 17
Williams
B $50

A problem with the above data should immediately be obvious. Tables in relational
databases, which would include most databases you'll work with, are in a simple grid, or
table format. Here, each project has a set of employees. So we couldn't even enter the
data into this kind of table. And if we tried to use null fields to cater for the fields that
have no value, then we cannot use the project number, or any other field, as a primary
key (a primary key is a field, or list of fields, that uniquely identify one record). There is
not much use in having a table if we can't uniquely identify each record in it.

Page 50
Advanced RDBMS

So, our solution is to make sure that each field has no sets, or repeating groups.
Now we can place the data in a table.

employee_project table

Project Employee Employee Rate Hourly


Project name
number number name category rate
Madagascar travel Vincent
1023 11 A $60
site Radebe
Madagascar travel
1023 12 Pauline James B $50
site
Madagascar travel Charles
1023 16 C $40
site Ramoraz
Online estate Vincent
1056 11 A $60
agency Radebe
Online estate Monique
1056 17 B $50
agency Williams

Notice that the project number cannot be a primary key on it's own. It does not uniquely
identify a row of data. So, our primary key must be a combination of project number and
employee number. Together these two fields uniquely identify one row of data. (Think
about it. You would never add the same employee more than once to a project. If for
some reason this could occur, you'd need to add something else to the key to make it
uniqueSo, now our data can go in table format, but there are still some problems with it.
We store the information that code 1023 refers to the Madagascar travel site 3 times!
Besides the waste of space, there is another serious problem. Look carefully at the data
below.

employee_project table

Project Employee Employee Rate Hourly


Project name
number number name category rate
Madagascar travel Vincent
1023
ANNAMALAI
ANNAMALAI UNIVERSITY
siteUNIVERSITY 11
Radebe
A $60

Madagascar travel
1023 12 Pauline James B $50
site
Madagascat travel Charles
1023 16 C $40
site Ramoraz
Online estate Vincent
1056 11 A $60
agency Radebe
Online estate Monique
1056 17 B $50
agency Williams

Page 51
Advanced RDBMS

Did you notice anything strange in the data above? Congratulations if you did!
Madagascar is misspelt in the 3rd record. Now imagine trying to spot this error in a table
with thousands of records! By using the structure above, the chances of the data being
corrupted increases drastically.

The solution is simply to take out the duplication. What we are doing formally is looking
for partial dependencies, ie fields that are dependent on a part of a key, and not the entire
key. Since both project number and employee number make up the key, we look for
fields that are dependent only on project number, or on employee number.

We identify two fields. Project name is dependent on project number only


(employee_number is irrelevant in determining project name), and the same applies to
employee name, hourly rate and rate category, which are dependent on employee
number. So, we take out these fields, as follows:

employee_project table

Project number Employee number


1023 11
1023 12
1023 16
1056 11
1056 17

Clearly we can't simply take out the data and leave it out of our database. We take it out,
and put it into a new table, consisting of the field that has the partial dependency, and the
field it is dependent on. So, we identified employee name, hourly rate and rate category
as being dependent on employee number.

The new table will consist of employee number as a key, and employee name, rate
category and hourly rate, as follows:

Employee table
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Employee number Employee name Rate category Hourly rate
11 Vincent Radebe A $60
12 Pauline James B $50
16 Charles Ramoraz C $40
17 Monique Williams B $50

And the same for the project data.

Page 52
Advanced RDBMS

Project table
Project number Project name
1023 Madagascar travel site
1056 Online estate agency

Note the reduction of duplication. The text "Madagascar travel site" is stored once only,
not for each occurrence of an employee working on that project. The link is made through
the key, the project number. Obviously there is no way to remove the duplication of this
number without losing the relation altogether, but it is far more efficient storing a short
number repeatedly, than a large piece of text

We're still not perfect. There is still room for anomalies in the data. Look carefully at the
data below.

Employee table

Employee number Employee name Rate category Hourly rate


11 Vincent Radebe A $60
12 Pauline James B $50
16 Charles Ramoraz C $40
17 Monique Williams B $40

The problem above is that Monique Williams has been awarded an hourly rate of $40,
when she is actually category B, and should be earning $50 (In the case of this company,
the rate category - hourly rate relationship is fixed. This may not always be the case).
Once again we are storing data redundantly: the hourly rate - rate category relationship is
being stored in its entirety for each employee. The solution, as before, is to remove this
excess data into its own table. Formally, what we are doing is looking for transitive
relationships, or relationships where a non-key attribute is dependent on another non-key
relationship. Hourly rate, while being in one sense dependent on Employee number (we
probably identified this dependency earlier, when looking for partial dependencies) is
actually dependent on Rate category. So, we remove it, and place it in a new table, with

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
its actual key, as follows.

Employee table

Employee number Employee name Rate category


11 Vincent Radebe A
12 Pauline James B
16 Charles Ramoraz C
17 Monique Williams B

Page 53
Advanced RDBMS

Rate table

Rate category Hourly rate


A $60
B $50
C $40

We've cut down once again. It is now impossible to mistakenly assume rate category "B"
is associated with an hourly rate of anything but $50. These relationships are only stored
in once place - our new table, where it can be ensured they are accurate.

a. Modification Anomalies

 Once our E-R model has been converted into relations, we may find that some
relations are not properly specified. There can be a number of problems:
o Deletion Anomaly: Deleting a relation results in some related information
(from another entity) being lost.
o Insertion Anomaly: Inserting a relation requires we have information
from two or more entities - this situation might not be feasible.
 Here is a quick example: A company has a Purchase order form:

Our dutiful consultant creates the E-R Model:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 54
Advanced RDBMS

LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty)


PO_HEADER (PO_Number, PODate, Vendor, Ship_To, ...)

Consider some sample data for the LINE_ITEMS relation:

PO_Number ItemNum PartNum Description Price Qty

O101 I01 P99 Plate $3.00 7

O101 I02 P98 Cup $1.00 11

O101 I03 P77 Bowl $2.00 6

O102 I01 P99 Plate $3.00 5

O102 I02 P77 Bowl $2.00 5


ANNAMALAI
ANNAMALAI UNIVERSITY
O103 UNIVERSITY
I01 P33 Fork $2.50 8

 What are some of the problems with this relation ?


 What happens when we delete item 2 from Order O101 ?
 These problems occur because the relation in question contains data about 2 or
more themes.
 Typical way to solve these anomalies is to split the relation in to two or more
relations - Process called Normalization.

Page 55
Advanced RDBMS

Normal Forms based on Primary keys

The normal forms defined in relational database theory represent guidelines for record
design. The guidelines corresponding to first through fifth normal forms are presented
here, in terms that do not require an understanding of relational theory. The design
guidelines are meaningful even if one is not using a relational database system. We
present the guidelines without referring to the concepts of the relational model in order to
emphasize their generality, and also to make them easier to understand. Our presentation
conveys an intuitive sense of the intended constraints on record design, although in its
informality it may be imprecise in some technical details. A comprehensive treatment of
the subject is provided by Date.

The normalization rules are designed to prevent update anomalies and data
inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward
the assumption that all non-key fields will be updated frequently. They tend to penalize
retrieval, since data which may have been retrievable from one record in an unnormalized
design may have to be retrieved from several records in the normalized form. There is no
obligation to fully normalize all records when actual performance requirements are taken
into account.

a. First Normal Form

First normal form is now considered to be part of the formal definition of a relation;
historically, it was defined to disallow multivalued attributes, composite attributes, and
their combinations. It states that the domains of attributes must include only atomic
(simple, indivisible) values and that the value of any attribute in a tuple must be a single
value from the domain of that attribute.

Practical Rule: "Eliminate Repeating Groups," i.e., make a separate table for each set of
related attributes, and give each table a primary key.

Formal Definition: A relation is in first normal form (1NF) if and only if all underlying
simple domains contain atomic values only.

First normal form deals with the "shape" of a record type.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Under first normal form, all occurrences of a record type must contain the same number
of fields.

First normal form excludes variable repeating fields and groups. This is not so much a
design guideline as a matter of definition. Relational database theory doesn't deal with
records having a variable number of fields.

Page 56
Advanced RDBMS

Example 1:

Let's run again through the example we've just done, this time without the data tables to
guide us. After all, when you're designing a system, you usually won't have test data
available at this stage. The tables were there to show you the consequences of storing
data in unnormalized tables, but without them we can focus on dependency issues, which
is the key to database normalization.

In the beginning, the data structure we had was as follows:

Project number
Project name
1-n Employee numbers (1-n indicates that there are many occurrences of this field - it is a
repeating group)
1-n Employee names
1-n Rate categories
1-n Hourly rates

So, to begin the normalization process, we start by moving from zero normal form to 1st
normal form.

The definition of 1st normal form


There are no repeating groups
All the key attributes are defined
All attributes are dependent on the primary key

So far, we have no keys, and there are repeating groups. So we remove the repeating
groups, and define the primary key, and are left with the following:

Employee project table

Project number - primary key


Project name
Employee number - primary key
ANNAMALAI
ANNAMALAI UNIVERSITY
Employee name
Rate category
UNIVERSITY
Hourly rate

This table is in 1st normal form.

Example 2:

Consider this example

Page 57
Advanced RDBMS

No repeating groups. As an example, it might be tempting to make an invoice table with


columns for the first, second, and third line item (see above). This violates the first
normal form, and would result in large rows, wasted space (where an invoice had less
than the maximum number of line items), and *horrible* SQL statements with a separate
join for each repetition of the column. First form normalization requires you make a
separate line item table, with it's own key (in this case the combination of invoice number
and line number) (See below).

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To conclude a relation is in first normal form if it meets the definition of a relation:

1. Each column (attribute) value must be a single value only.


2. All values for a given column (attribute) must be of the same type.
3. Each column (attribute) name must be unique.
4. The order of columns is insignificant.
5. No two rows (tuples) in a relation can be identical.

Page 58
Advanced RDBMS

6. The order of the rows (tuples) is insignificant.


 If you have a key defined for the relation, then you can meet the unique row
requirement.
 Example relation in 1NF:
STOCKS (Company, Symbol, Date, Close_Price)

Company Symbol Date Close Price


IBM IBM 01/05/94 101.00
IBM IBM 01/06/94 100.50
IBM IBM 01/07/94 102.00
Netscape NETS 01/05/94 33.00
Netscape NETS 01/06/94 112.00

We deal now only with "single-valued" facts. The fact could be a one-to-many
relationship, such as the department of an employee, or a one-to-one relationship, such as
the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one
or one-to-many relationship between Y and X. In the general case, Y might consist of one
or more fields, and so might X. In the following example, QUANTITY is a fact about the
combination of PART and WAREHOUSE.

b. General Definitions of Second and Third Normal Forms

Second normal form is based on the concept of fully functional dependency. A


functional X->Y is a fully functional dependency is removal of any attribute A from X
means that the dependency does not hold any more. A relation schema is in Second
Normal Form if every nonprime attribute in relation is fully functionally dependent on
the primary key of the relation. It also can be restated as: a relation schema is in Second
Normal Form if every nonprime attribute in relation is not partially dependent on any key
of the relation.

Practical Rule: "Eliminate Redundant Data," i.e., if an attribute depends on only part of a
multivalued key, remove it to a separate table.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Formal Definition: A relation is in second normal form (2NF) if and only if it obeys the
conditions of First Normal Form and every nonkey attribute is fully dependent on the
primary key.

To put it simple a table is in 2nd normal form if

It’s in 1st normal form


It includes no partial dependencies (where an attribute is dependent on only a part of a
primary key).

Page 59
Advanced RDBMS

Example 1:

So, we go through all the fields. Considering our example, Project name is only
dependent on Project number. Employee name, Rate category and Hourly rate are
dependent only on Employee number. So we remove them, and place these fields in a
separate table, with the key being that part of the original key they are dependent on. So,
we are left with the following 3 tables:

Employee project table

Project number - primary key


Employee number - primary key

Employee table

Employee number - primary key


Employee name
Rate category
Hourly rate

Project table

Project number - primary key


Project name

The table is now in 2nd normal form.

Example 2:

Consider this example.

As we know that second normal form is violated when a non-key field is a fact about a
subset of a key. It is only relevant when the key is composite, i.e., consists of several
fields. Consider the following inventory record:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------

The key here consists of the PART and WAREHOUSE fields together, but
WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems
with this design are:

 The warehouse address is repeated in every record that refers to a part stored in
that warehouse.

Page 60
Advanced RDBMS

 If the address of the warehouse changes, every record referring to a part stored in
that warehouse must be updated.
 Because of the redundancy, the data might become inconsistent, with different
records showing different addresses for the same warehouse.
 If at some point in time there are no parts stored in the warehouse, there may be
no record in which to keep the warehouse's address.

To satisfy second normal form, the record shown above should be decomposed into
(replaced by) the two records:

------------------------------- --------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-
ADDRESS |
====================----------- =============-------------------

When a data design is changed in this way, replacing unnormalized records with
normalized records, the process is referred to as normalization. The term "normalization"
is sometimes used relative to a particular normal form. Thus a set of records may be
normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and
inconsistency, but at some possible performance cost for certain retrieval applications.
Consider an application that wants the addresses of all warehouses stocking a certain part.
In the unnormalized form, the application searches one record type. With the normalized
design, the application has to search two record types, and connect the appropriate pairs.

To summarize,

 A relation is in second normal form (2NF) if all of its non-key attributes are
dependent on all of the key.
 Relations that have a single attribute for a key are automatically in 2NF.
 This is one reason why we often use artificial identifiers as keys.
 In the example below, Close Price is dependent on Company, Date and Symbol,
Date
 The following example relation is not in 2NF:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
STOCKS (Company, Symbol, Headquarters, Date, Close_Price)

Company Symbol Headquarters Date Close Price


IBM IBM Armonk, NY 01/05/94 101.00
SONY SONY Armonk, NY 01/06/94 100.50
SONY SONY Armonk, NY 01/07/94 102.00
Netscape NETS Sunyvale, CA 01/05/94 33.00
Netscape NETS Sunyvale, CA 01/06/94 112.00

Page 61
Advanced RDBMS

Company, Date -> Close Price


Symbol, Date -> Close Price
Company -> Symbol, Headquarters
Symbol -> Company, Headquarters

Consider that Company, Date -> Close Price.


So we might use Company, Date as our key.
However: Company -> Headquarters
This violates the rule for 2NF. Also, consider the insertion and deletion anomalies.

One Solution: Split this up into two relations:


COMPANY (Company, Symbol, Headquarters)
STOCKS (Symbol, Date, Close_Price)

Company Symbol Headquarters


SONY SONY Armonk, NY
Netscape NETS Sunnyvale, CA
Company -> Symbol, Headquarters
Symbol -> Company, Headquarters

Symbol Date Close Price


SONY 01/05/94 101.00
SONY 01/06/94 100.50
SONY 01/07/94 102.00
NETS 01/05/94 33.00
NETS 01/06/94 112.00

Symbol, Date -> Close Price

c. Third Normal Form


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Third normal form is based on the concept of transitive dependency. A functional
dependency X->Y in a relation is a transitive dependency if there is a set of attributes Z
that is not a subset of any key of the relation, and both X->Z and Z->Y hold. In other
words, a relation is in 3NF if, whenever a functional dependency X->A holds in the
relation, either (a) X is a superkey of the relation, or (b) A is a prime attribute of the
relation.

Practical Rule: "Eliminate Columns not Dependent on Key," i.e., if attributes do not
contribute to a description of a key, remove them to a separate table.

Page 62
Advanced RDBMS

Formal Definition: A relation is in third normal form (3NF) if and only if it is in 2NF
and every nonkey attribute is nontransitively dependent on the primary key.

To put it simple the definition of 3rd normal form

It's in 2nd normal form


It contains no transitive dependencies (where a non-key attribute is dependent on another
non-key attribute).

Example 1:

We can narrow our search down to the Employee table, which is the only one with more
than one non-key attribute. Employee name is not dependent on either Rate category or
Hourly rate, the same applies to Rate category, but Hourly rate is dependent on Rate
category. So, as before, we remove it, placing it in it's own table, with the attribute it was
dependent on as key, as follows:

Employee project table

Project number - primary key


Employee number - primary key

Employee table

Employee number - primary key


Employee name
Rate Category

Rate table

Rate category - primary key


Hourly rate

Project table

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Project number - primary key
Project name

These tables are all now in 3rd normal form, and ready to be implemented.

Example 2:

Third normal form is violated when a non-key field is a fact about another non-key field,
as in

Page 63
Advanced RDBMS

------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------

The EMPLOYEE field is the key. If each department is located in one place, then the
LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about
the EMPLOYEE. The problems with this design are the same as those caused by
violations of second normal form:

 The department's location is repeated in the record of every employee assigned to


that department.
 If the location of the department changes, every such record must be updated.
 Because of the redundancy, the data might become inconsistent, with different
records showing different locations for the same department.
 If a department has no employees, there may be no record in which to keep the
department's location.

To satisfy third normal form, the record shown above should be decomposed into the two
records:

------------------------- -------------------------
| Item No. | Part Number | | Part Number | Part Name |
------------------------- --------------------------

To summarize, a record is in third normal forms if

 If it is in second normal form and it contains no transitive dependencies.


 Consider relation R containing attributes A, B and C.
If A -> B and B -> C then A -> C

Transitive Dependency: Three attributes with the above dependencies.

Example: At CUNY:

Course_Code -> Course_Num, Section

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Course_Num, Section -> Classroom, Professor

Example: At Rutgers:

Course_Index_Num -> Course_Num, Section


Course_Num, Section -> Classroom, Professor

Page 64
Advanced RDBMS

Example:

Company County Tax Rate


IBM Putnam 28%
AT&T Bergen 26%

Company -> County


and
County -> Tax Rate
thus
Company -> Tax Rate

What happens if we remove AT&T?


We loose information about 2 different themes.

Split this up into two relations:

Company County
SONY Putnam
AT&T Ritchie

Company -> County

County Tax Rate


Putnam 28%
Ritchie 26%

County -> Tax Rate

Before you rush off and start normalizing everything, a word of warning. No process is
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
better than good old common sense. Take a look at this example.

Customer table

Number - primary key


Name
Address
Zip Code
Town

Page 65
Advanced RDBMS

What normal form is this table in? Giving it a quick glance, we see no repeating groups,
and a primary key defined, so it's at least in 1st normal form. There's only one key, so we
needn't even look for partial dependencies, so it's at least in 2nd normal form. How about
transitive dependencies? Well, it looks like Town might be determined by Zip Code. And
in most parts of the world that's usually the case. So we should remove Town, and place
it in a separate table, with Zip Code as the key? No! Although this table is not technically
in 3rd normal form, removing this information is not worth it.

Creating more tables increases the load slightly, slowing processing down. This is often
counteracted by the reduction in table sizes, and redundant data. But in this case, where
the town would almost always be referenced as part of the address, it isn't worth it.
Perhaps a company that uses the data to produce regular mailing lists of thousands of
customers should normalize fully. It always comes down to how the data is going to be
used. Normalization is just a helpful process that usually results in the most efficient table
structure, and not a rule for database design. But judging from some of the table
structures I've seen out there, it's better to err and normalize than err and not!

Functional dependency - a field Y is functionally dependent on a field (or fields) X if it


is invalid to have two records with the same X value but different Y values. (A given X
must always occur with the same Y.)

 A Functional Dependency describes a relationship between attributes in a single


relation.
 An attribute is functionally dependant on another if we can use the value of one
attribute to determine the value of another.

Example: Employee_Name is functionally dependant on Social_Security_Number


because Social_Security_Number can be used to determine the value of
Employee_Name.

We use the symbol -> to indicate a functional dependency.


-> is read functionally determines

Student_ID -> Student_Major


Student_ID, Course#, Semester# -> Grade

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SKU -> Compact_Disk_Title, Artist
Model, Options, Tax -> Car_Price
Course_Number, Section -> Professor, Classroom, Number of
Students

The attributes listed on the left hand side of the -> are called determinants.
One can read A -> B as, "A determines B".

i. Keys and Uniqueness

 Key: One or more attributes that uniquely identify a tuple (row) in a relation.

Page 66
Advanced RDBMS

 The selection of keys will depend on the particular application being considered.
 Users can offer some guidance as to what would make an appropriate key. Also
this is pretty much an art as opposed to an exact science.
 Recall that no two relations should have exactly the same values, thus a candidate
key would consist of all of the attributes in a relation.
 A key functionally determines a tuple (row).

Not all determinants are keys.

Consider this example.

In relational database theory, second and third normal forms are defined in terms of
functional dependencies, which correspond approximately to our single-valued facts. A
field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two
records with the same X-value but different Y-values. That is, a given X-value must
always occur with the same Y-value. When X is a key, then all fields are by definition
functionally dependent on X in a trivial way, since there can't be two records having the
same X value.

There is a slight technical difference between functional dependencies and single-valued


facts as we have presented them. Functional dependencies only exist when the things
involved have unique and singular identifiers (representations). For example, suppose a
person's address is a single-valued fact, i.e., a person has only one address. If we don't
provide unique identifiers for people, then there will not be a functional dependency in
the data:

PERSON ADDRESS
John Smith 123 Main St.,

New York
John Smith 321 Center St., San Francisco

Although each person has a unique address, a given name can appear with several
different addresses. Hence we do not have a functional dependency corresponding to our
single-valued fact.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Similarly, the address has to be spelled identically in each occurrence in order to have a
functional dependency. In the following case the same person appears to be living at two
different addresses, again precluding a functional dependency.

---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------

Page 67
Advanced RDBMS

We are not defending the use of non-unique or non-singular representations. Such


practices often lead to data maintenance problems of their own. We do wish to point out,
however, that functional dependencies and the various normal forms are really only
defined for situations in which there are unique and singular identifiers. Thus the design
guidelines as we present them are a bit stronger than those implied by the formal
definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued
fact about a non-key field, and hence the design is susceptible to all the update anomalies
mentioned earlier.

----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------

However, in formal terms, there is no functional dependency here between FATHER'S-


ADDRESS and FATHER, and hence no violation of third normal form.

d. Boyce-Codd normal form

Boyce-Codd normal form is stricter than 3NF, meaning that every relation in BCNF is
also in 3NF; however, a relation in 3NF is not necessarily in BCNF. A relation schema is
an BCNF if whenever a functional dependency X->A holds in the relation, then X is a
superkey of the relation. The only difference between BCNF and 3NF is that condition
(b) of 3NF, which allows A to be prime if X is not a superkey, is absent from BCNF.

Formal Definition: A relation is in Boyce/Codd normal form (BCNF) if and only if every
determinant is a candidate key. [A determinant is any attribute on which some other
attribute is (fully) functionally dependent.]

To put it simple, a relation is in Boyce-Codd if every determinant is a candidate key.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Steps in analyzing for BCNF:

(1) Find and list all the candidate keys. (Usually the primary key is known.)
(2) Determine and list all functional dependencies, noting those which are
dependent on attributes which are not the entire primary key.
(3) Determine if any dependencies exist which are based on part but not all of a
candidate key.
(4) Project into relations which remove the problems found in (3).

To summarize,

Page 68
Advanced RDBMS

 A relation is in BCNF if every determinant is a candidate key.


 Recall that not all determinants are keys.
 Those determinants that are keys we initially call candidate keys.
 Eventually, we select a single candidate key to be the primary key for the relation.
 Consider the following example:
Funds consist of one or more Investment Types.
Funds are managed by one or more Managers
Investment Types can have one more Managers
Managers only manage one type of investment.

FundID InvestmentType Manager


99 Common Stock Smith
99 Municipal Bonds Jones
33 Common Stock Green
22 Growth Stocks Brown
11 Common Stock Smith

FundID, InvestmentType -> Manager


FundID, Manager -> InvestmentType
Manager -> InvestmentType

In this case, the combination FundID and InvestmentType form a candidate key
because we can use FundID,InvestmentType to uniquely identify a tuple in the
relation.

Similarly, the combination FundID and Manager also form a candidate key because
we can use FundID, Manager to uniquely identify a tuple.

Manager by itself is not a candidate key because we cannot use Manager alone to
uniquely identify a tuple in the relation.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Is this relation R(FundID, InvestmentType, Manager) in 1NF, 2NF or 3NF ?
Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure.
2NF because all of the non-key attributes (Manager) is dependant on all of the key.
3NF because there are no transitive dependencies.

Consider what happens if we delete the tuple with FundID 22. We loose the fact that
Brown manages the InvestmentType "Common Stocks."

The following are steps to normalize a relation into BCNF:

1. List all of the determinants.

Page 69
Advanced RDBMS

2. See if each determinant can act as a key (candidate keys).


3. For any determinant that is not a candidate key, create a new relation from
the functional dependency. Retain the determinant in the original relation.

For our example:


Rorig(FundID, InvestmentType, Manager)

The determinants are:


FundID, InvestmentType
FundID, Manager
Manager

Which determinants can act as keys ?


FundID, InvestmentType YES
FundID, Manager YES
Manager NO

Create a new relation from the functional dependency:

Rnew(Manager, InvestmentType)
Rorig(FundID, Manager)

In this last step, we have retained the determinant "Manager" in the original relation
Rorig.

Relational Database Design and Further Dependencies

Fourth and fifth normal forms deal with multi-valued facts. The multi-valued fact may
correspond to a many-to-many relationship, as with employees and skills, or to a many-
to-one relationship, as with the children of an employee (assuming only one parent is an
employee). By "many-to-many" we mean that an employee may have several skills, and a
skill may belong to several employees.

In a sense, fourth and fifth normal forms are also about composite keys. These normal
forms attempt to minimize the number of fields involved in a composite key, as

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
suggested by the examples to follow.

Multivalued Dependencies and Fourth Normal Form

Multivalued dependencies are a consequence of first normal form, which disallowed an


attribute in a tuple to have a set of values. If we have two or more multivalued
independent attributes in the same relation schema, we get into a problem of having to
repeat every value of one of the attributes with every value of the other attribute to keep
the relation instances consistent.

Page 70
Advanced RDBMS

Fourth normal form is based on multivalued dependencies, which is violated when a


relation has undesirable multivalued dependencies, and hence can be used to identify and
decompose such relations. A relation scheme R is in 4NF with respect to a set of
dependencies F is, for every nontrivial multivalued dependency X->F, X is a superkey for
R.

Practical Rule: "Isolate Independent Multiple Relationships," i.e., no table may contain
two or more 1:n or n:m relationships that are not directly related.

Formal Definition: A relation R is in fourth normal form (4NF) if and only if, whenever
there exists a multivalued dependency in the R, say A->>B, then all attributes of R are
also functionally dependent on A.

Under fourth normal form, a record type should not contain two or more independent
multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills
and several languages. We have here two many-to-many relationships, one between
employees and skills, and one between employees and languages. Under fourth normal
form, these two relationships should not be represented in a single record such as

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================

Instead, they should be represented in the two records

-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================

Note that other fields, not involving multi-valued facts, are permitted to occur in the
record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE

ANNAMALAI
ANNAMALAI UNIVERSITY
example.
UNIVERSITY
The main problem with violating fourth normal form is that it leads to uncertainties in the
maintenance policies. Several policies are possible for maintaining two independent
multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|

Page 71
Advanced RDBMS

| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------

This is not much different from maintaining two separate record types. (We note in
passing that such a format also leads to ambiguities regarding the meanings of blank
fields. A blank SKILL could mean the person has no skill, or the field is not applicable to
this employee, or the data is unknown, or, as in this case, the data may be found in
another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------

(b) Minimal number of records, with null values:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(c) Unrestricted:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------

Page 72
Advanced RDBMS

(3) A "cross-product" form, where for each employee, there must be a record for every
possible pairing of one of his skills with one of his languages:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------

Other problems caused by violating fourth normal form are similar in spirit to those
mentioned earlier for violations of second or third normal form. They take different
variations depending on the chosen maintenance policy:

 If there are repetitions, then updates have to be done in multiple records, and they
could become inconsistent.
 Insertion of a new skill may involve looking for a record with a blank skill, or
inserting a new record with a possibly blank language, or inserting multiple
records pairing the new skill with some or all of the languages.
 Deletion of a skill may involve blanking out the skill field in one or more records
(perhaps with a check that this doesn't leave two records with the same language
and a blank skill), or deleting one or more records, coupled with a check that the
last mention of some language hasn't also been deleted.

Fourth normal form minimizes such update problems.

a. Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we


mean in terms of the example. The two many-to-many relationships, employee:skill and
employee:language, are "independent" in that there is no direct connection between skills

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
and languages. There is only an indirect connection because they belong to some
common employee. That is, it does not matter which skill is paired with which language
in a record; the pairing does not convey any information. That's precisely why all the
maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain
languages. Perhaps Smith can cook French cuisine only, but can type in French, German,
and Greek. Then the pairings of skills and languages becomes meaningful, and there is no
longer an ambiguity of maintenance policies. In the present case, only the following form
is correct:

Page 73
Advanced RDBMS

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------

Thus the employee:skill and employee:language relationships are no longer independent.


These records do not violate fourth normal form. When there is an interdependence
among the relationships, then it is acceptable to represent them in a single record.

b. Multivalued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit
further, we mention that fourth normal form is defined in terms of multivalued
dependencies, which correspond to our independent multi-valued facts. Multivalued
dependencies, in turn, are defined essentially as relationships which accept the "cross-
product" maintenance policy mentioned above. That is, for our example, every one of an
employee's skills must appear paired with every one of his languages. It may or may not
be obvious to the reader that this is equivalent to our notion of independence: since every
possible pairing must be present, there is no "information" in the pairings. Such pairings
convey information only if some of them can be absent, that is, only if it is possible that
some employee cannot perform some skill in some language. If all pairings are always
present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as
well to relationships involving more than two fields. For example, suppose we extend the
earlier example to include projects, in the following sense:

 An employee uses certain skills on certain projects.


 An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
a project, then we could treat this as two independent many-to-many relationships of the
form EP:S and EP:L, where "EP" represents a combination of an employee with a
project. A record including employee, project, skill, and language would violate fourth
normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy
fourth normal form.

To summarize,

 A relation is in fourth normal form if it is in BCNF and it contains no multivalued


dependencies.

Page 74
Advanced RDBMS

Multivalued Dependency: A type of functional dependency where the determinant


can determine more than one value.

More formally, there are 3 criteria:

1. There must be at least 3 attributes in the relation. call them A, B, and C,


for example.
2. Given A, one can determine multiple values of B.
Given A, one can determine multiple values of C.
3. B and C are independent of one another.

Book example:
Student has one or more majors.
Student participates in one or more activities.

StudentID Major Activities


100 CIS Baseball
100 CIS Volleyball
100 Accounting Baseball
100 Accounting Volleyball
200 Marketing Swimming

StudentID ->-> Major


StudentID ->-> Activities

Portfolio
Stock Fund Bond Fund
ID
999 Janus Fund Municipal Bonds
Dreyfus Short-Intermediate Municipal Bond
ANNAMALAI
ANNAMALAI UNIVERSITY
999
UNIVERSITY
Janus Fund
Fund
Scudder Global
999 Municipal Bonds
Fund
Scudder Global Dreyfus Short-Intermediate Municipal Bond
999
Fund Fund
888 Kaufmann Fund T. Rowe Price Emerging Markets Bond Fund

A few characteristics:

Page 75
Advanced RDBMS

1. No regular functional dependencies


2. All three attributes taken together form the key.
3. Latter two attributes are independent of one another.
4. Insertion anomaly: Cannot add a stock fund without adding a bond fund
(NULL Value). Must always maintain the combinations to preserve the
meaning.

Stock Fund and Bond Fund form a multivalued dependency on Portfolio ID.

PortfolioID ->-> Stock Fund


PortfolioID ->-> Bond Fund

Resolution: Split into two tables with the common key:

Portfolio ID Stock Fund


999 Janus Fund
999 Scudder Global Fund
888 Kaufmann Fund

Portfolio ID Bond Fund


999 Municipal Bonds
999 Dreyfus Short-Intermediate Municipal Bond Fund
888 T. Rowe Price Emerging Markets Bond Fund

c. Join Dependencies and Fifth Normal Form

In some cases there may be no losses join decomposition into two relation schemas but
there may be a losses join decomposition into more than two relation schemas. These
cases are handled by the join dependency and fifth normal form, and it’s important to
note that these cases occur very rarely and are difficult to detect in practice.

Practical Rule: "Isolate Semantically Related Multiple Relationships," i.e., there may be
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
practical constraints on information that justify separating logically related many-to-
many relationships.

Formal Definition: A relation R is in fifth normal form (5NF)—also called projection-


join normal form (PJNF)—if and only if every join dependency in R is a consequence of
the candidate keys of R.

A join dependency (JD) specified on a relations schema R, specifies a constraint on


instances of R. The constraint states that every legal instance of R should have a losses
join decomposition into sub-relations of R, that when reunited make the entire relation R.
A relation schema R is in fifth normal form (5NF) (or project-join normal form (PJNF))

Page 76
Advanced RDBMS

with respect to a set F of functional, multivalued, and join dependencies if, for every
nontrivial join dependency JD(R1, R2, …, Rn) in F (implied by F), every Ri is a superkey
of R.

Fifth normal form deals with cases where information can be reconstructed from smaller
pieces of information that can be maintained with less redundancy. Second, third, and
fourth normal forms also serve this purpose, but fifth normal form generalizes to cases
not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the
central concept with a commonly used example, namely one involving agents,
companies, and products. If agents represent companies, companies make products, and
agents sell products, then we might want to keep a record of which agent sells which
product for which company. This information could be kept in one record type with three
fields:

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------

This form is necessary in the general case. For example, although agent Smith sells cars
made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we
need the combination of three fields to know which combinations are valid and which are
not. But suppose that a certain rule was in effect: if an agent sells a certain product, and
he represents a company making that product, then he sells that product for that company.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------

In this case, it turns out that we can reconstruct all the true facts from a normalized form
consisting of three separate record types, each containing two fields:

Page 77
Advanced RDBMS

------------------- --------------------- -------------------


| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- | GM | truck | -------------------
---------------------

These three record types are in fifth normal form, whereas the corresponding three-field
record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its
information content cannot be reconstructed from several smaller record types, i.e., from
record types each having fewer fields than the original record. The case where all the
smaller records have the same key is excluded. If a record type can only be decomposed
into smaller records which all have the same key, then the record type is considered to be
in fifth normal form without decomposition. A record type in fifth normal form is also in
fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a
symmetric constraint such as the rule about agents, companies, and products. In the
absence of such a constraint, a record type in fourth normal form is always in fifth normal
form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the
normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized
form it may be repeated many times.

It should be observed that although the normalized form involves more record types,
there may be fewer total record occurrences. This is not apparent when there are only a
few facts to record, as in the example shown above. The advantage is realized as more
facts are recorded, since the size of the normalized files increases in an additive fashion,
while the size of the unnormalized file increases in a multiplicative fashion. For example,
if we add a new agent who sells x products for y companies, where each of these

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
companies makes each of these products, we have to add x+y new records to the
normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order
to reconstruct the same information. From the first two record types shown above we
learn that Jones represents Ford and that Ford makes trucks. But we can't determine
whether Jones sells Ford trucks until we look at the third record type to determine
whether Jones sells trucks at all.

Page 78
Advanced RDBMS

The following example illustrates a case in which the rule about agents, companies, and
products is satisfied, and which clearly requires all three record types in the normalized
form. Any two of the record types taken alone will imply something untrue.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------

Observe that:

 Jones sells cars and GM makes cars, but Jones does not represent GM.
 Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
 Brown represents Ford and Brown sells buses, but Ford does not make buses.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Fourth and fifth normal forms both deal with combinations of multivalued facts. One
difference is that the facts dealt with under fifth normal form are not independent, in the
sense discussed earlier. Another difference is that, although fourth normal form can deal
with more than two multivalued facts, it only recognizes them in pairwise groups. We can
best explain this in terms of the normalization process implied by fourth normal form.

If a record violates fourth normal form, the associated normalization process decomposes
it into two records, each containing fewer fields than the original record. Any of these
violating fourth normal form is again decomposed into two records, and so on until the

Page 79
Advanced RDBMS

resulting records are all in fourth normal form. At each stage, the set of records after
decomposition contains exactly the same information as the set of records before
decomposition.

In the present example, no pairwise decomposition is possible. There is no combination


of two smaller records which contains the same total information as the original record.
All three of the smaller records are needed. Hence an information-preserving pairwise
decomposition is not possible, and the original record is not in violation of fourth normal
form. Fifth normal form is needed in order to deal with the redundancies in this case.

To summarize,

 There are certain conditions under which after decomposing a relation, it cannot
be reassembled back into its original form.
 We don't consider these issues here.

2.2.2 Inclusion Dependencies, other Dependencies and Normal Forms

Domain Key Normal Form (DKNF)

We can also always define stricter forms that take into account additional types of
dependencies and constraints. The idea behind domain-key normal form is to specify,
(theoretically, at least) the "ultimate normal form" that takes into account all possible
dependencies and constraints. A relation is said to be in DKNF if all constraints and
dependencies that should hold on the relation can be enforced simply by enforcing the
domain constraints and the key constraints specified on the relation.

For a relation in DKNF, it becomes very straightforward to enforce the constraints by


simply checking that each attribute value in a tuple is of the appropriate domain and that
every key constraint on the relation is enforced. However, it seems unlikely that complex
constraints can be included in a DKNF relation; hence, its practical utility is limited

 A relation is in DK/NF if every constraint on the relation is a logical consequence


of the definition of keys and domains.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Constraint: An rule governing static values of an attribute such that we can determine if
this constraint is True or False. Examples:

1. Functional Dependencies
2. Multivalued Dependencies
3. Inter-relation rules
4. Intra-relation rules

However: Does Not include time dependent constraints.

 Key: Unique identifier of a tuple.

Page 80
Advanced RDBMS

 Domain: The physical (data type, size, NULL values) and semantic (logical)
description of what values an attribute can hold.
 There is no known algorithm for converting a relation directly into DK/NF.

Unavoidable Redundancies

Normalization certainly doesn't remove all redundancies. Certain redundancies seem to


be unavoidable, particularly when several multivalued facts are dependent rather than
independent. In the example shown in the independence topic, it seems unavoidable that
we record the fact that "Smith can type" several times. Also, when the rule about agents,
companies, and products is not in effect, it seems unavoidable that we record the fact that
"Smith sells cars" several times.

a. Inter-Record Redundancy

The normal forms discussed here deal only with redundancies occurring within a single
record type. Fifth normal form is considered to be the "ultimate" normal form with
respect to such redundancies..

Other redundancies can occur across multiple record types. For the example concerning
employees, departments, and locations, the following records are in third normal form in
spite of the obvious redundancy:

------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------

In fact, two copies of the same record type would constitute the ultimate in this kind of
undetected redundancy.

Inter-record redundancy has been recognized for some time, and has recently been
addressed in terms of normal forms and normalization.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
While we have tried to present the normal forms in a simple and understandable way, we
are by no means suggesting that the data design process is correspondingly simple. The
design process involves many complexities which are quite beyond the scope of this
paper. In the first place, an initial set of data elements and records has to be developed, as
candidates for normalization.

Then the factors affecting normalization have to be assessed:

 Single-valued vs. multi-valued facts.


 Dependency on the entire key.

Page 81
Advanced RDBMS

 Independent vs. dependent facts.


 The presence of mutual constraints.
 The presence of non-unique or non-singular representations.

And, finally, the desirability of normalization has to be assessed, in terms of its


performance impact on retrieval applications.

Practical Database Design and Tuning: The Role of Information Systems in


Organisations

The role played by the Information Systems at this Information Technology


explosion is very eminent and it has become a necessity . The Information
System makes an organization to function effectively and efficiently. It provides
information to all the levels of the organization the needed information for the
current and future needs and growth.

The information Systems help the management people to organize their schedules
and plan for the development and growth of the organization themselves. Modern
technologies goes a step even deeper that it gives the information to the mobile
phones of the top authorities. They are provided with the needed information on
their palm tops

The Database Design Process

Many of you asked for a "complete" example that would run through all of the normal
forms from beginning to end using the same tables. This is tough to do, but here is an
attempt:

Example relation:
EMPLOYEE ( Name, Project, Task, Office, Phone )

Note: Keys are underlined.

Example Data:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Name Project Task Office Floor Phone
Bill 100X T1 400 4 1400
Bill 100X T2 400 4 1400
Bill 200Y T1 400 4 1400
Bill 200Y T2 400 4 1400
Sue 100X T33 442 4 1442
Sue 200Y T33 442 4 1442

Page 82
Advanced RDBMS

Sue 300Z T33 442 4 1442


Ed 100X T2 588 5 1588

 Name is the employee's name


 Project is the project they are working on. Bill is working on two different
projects, Sue is working on 3.
 Task is the current task being worked on. Bill is now working on Tasks T1 and
T2. Note that Tasks are independent of the project. Examples of a task might be
faxing a memo or holding a meeting.
 Office is the office number for the employee. Bill works in office number 400.
 Floor is the floor on which the office is located.
 Phone is the phone extension. Note this is associated with the phone in the given
office.

Physical Database Design in Relational Databases

a. First Normal Form

 Assume the key is Name, Project, Task.


 Is EMPLOYEE in 1NF ?

b. Second Normal Form

 List all of the functional dependencies for EMPLOYEE.


 Are all of the non-key attributes dependant on all of the key ?
 Split into two relations EMPLOYEE_PROJECT_TASK and
EMPLOYEE_OFFICE_PHONE. EMPLOYEE_PROJECT_TASK (Name,
Project, Task)

Name Project Task


Bill 100X T1
Bill 100X T2

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill
Bill
200Y
200Y
T1
T2
Sue 100X T33
Sue 200Y T33
Sue 300Z T33
Ed 100X T2

 EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone)

Page 83
Advanced RDBMS

Name Office Floor Phone


Bill 400 4 1400
Sue 442 4 1442
Ed 588 5 1588

c. Third Normal Form

 Assume each office has exactly one phone number.


 Are there any transitive dependencies ?
 Where are the modification anomalies in EMPLOYEE_OFFICE_PHONE ?
 Split EMPLOYEE_OFFICE_PHONE.

EMPLOYEE_PROJECT_TASK (Name, Project, Task)

Name Project Task


Bill 100X T1
Bill 100X T2
Bill 200Y T1
Bill 200Y T2
Sue 100X T33
Sue 200Y T33
Sue 300Z T33
Ed 100X T2

EMPLOYEE_OFFICE (Name, Office, Floor)

Name Office Floor

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill
Sue
400
442
4
4
Ed 588 5

EMPLOYEE_PHONE (Office, Phone)

Office Phone
400 1400

Page 84
Advanced RDBMS

442 1442
588 1588

d. Boyce-Codd Normal Form

 List all of the functional dependencies for EMPLOYEE_PROJECT_TASK,


EMPLOYEE_OFFICE and EMPLOYEE_PHONE. Look at the determinants.
 Are all determinants candidate keys ?

e. Fourth Normal Form

 Are there any multivalued dependencies ?


 What are the modification anomalies ?
 Split EMPLOYEE_PROJECT_TASK.

EMPLOYEE_PROJECT (Name, Project )

Name Project
Bill 100X
Bill 200Y
Sue 100X
Sue 200Y
Sue 300Z
Ed 100X

EMPLOYEE_TASK (Name, Task )

Name Task
Bill T1
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill T2
Sue T33
Ed T2

EMPLOYEE_OFFICE (Name, Office, Floor)

Name Office Floor


Bill 400 4

Page 85
Advanced RDBMS

Sue 442 4
Ed 588 5

R4 (Office, Phone)

Office Phone
400 1400
442 1442
588 1588

At each step of the process, we did the following:

1. Write out the relation


2. (optionally) Write out some example data.
3. Write out all of the functional dependencies
4. Starting with 1NF, go through each normal form and state why the relation is in
the given normal form.

Another short example

Consider the following example of normalization for a CUSTOMER relation.

Relation Name
CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone)

Example Data

CustomerID Name Street City State Zip Phone


C101 Bill Smith 123 First St. New Brunswick NJ 07101 732-555-1212
C102 Mary Green 11 Birch St. Old Bridge NJ 07066 908-555-1212

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
f. Functional Dependencies

CustomerID -> Name, Street, City, State, Zip, Phone


Zip -> City, State

g. Normalization

 1NF Meets the definition of a relation.


 2NF All non key attributes are dependent on all of the key.
 3NF There are no transitive dependencies.

Page 86
Advanced RDBMS

 BCNF Relation CUSTOMER is not in BCNF because one of the determinants


Zip can not act as a key for the entire relation. Solution: Split CUSTOMER into
two relations:
CUSTOMER (CustomerID, Name, Street, Zip, Phone)
ZIPCODES (Zip, City, State)

Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to
BCNF.

 4NF There are no multi-valued dependencies in either CUSTOMER or


ZIPCODES.

As a final step, consider de-normalization.

2.2.3 An overview of Databases Tuning In Relational Databases & Automated


Design Tools

The databases needed to be tuned and pruned in order to provide with a updated and
upgraded information. So it is vital to manage the database .
If you want to get the maximum performance from your applications you need to tune
your SQL statements. Tuning of SQL statements means discovering the execution plan
that Oracle is using. Once the execution plan is known one can attempt to improve it.

The performance of Query can be improved in many ways. By creating indexes one can
increase the size of the buffer cache, and use optimizer hints. Hints are instructions to the
Oracle optimizer that are buried within your statement. These Hints are used to control
virtually any aspect of statement execution.

Two important points when tuning rollback segments are Detecting contention and
reducing shrinkage.
Contention occurs when there are too few rollback segments in your database for the
amount of updates that are occurring. Shrinkage occurs when one defined an optimal
size for a rollback segment and then the rollback segment grows beyond that size and is
forced to shrink back again.

Normalization is carried out in practice so that the resulting designs are of high quality
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
and meet the desirable properties

The practical utility of these normal forms becomes questionable when the constraints on
which they are based are hard to understand or to detect

The database designers need not normalize to the highest possible normal form. ( usually
up to 3NF, BCNF or 4NF)

Denormalization: the process of storing the join of higher normal forms relations as a
base relation – which is in a lower normal form

Page 87
Advanced RDBMS

2.3. Revision Points


 Normalisation may be defined as a step by step reversible process of transforming
an unnormalised relation into relations with progressively simpler structures.
 Denormalization: the process of storing the join of higher normal forms relations
as a base relation – which is in a lower normal form
 Constraint: An rule governing static values of an attribute such that we can
determine if this constraint is True or False.
 There are three types of anomalies involved (1)update anomaly, (2)insertion
anomaly and (3)deletion anomaly.
 A relation is in DK/NF if every constraint on the relation is a logical consequence
of the definition of keys and domains.

2.4. Intext Questions


1. What do you mean by normalization? Explain
2. Explain why we go for normalization with example.
3. What are the three types of anomalies?
4. Explain the first normal form with an example.
5. Define second normal form.
6. What do you mean by functional dependencies
7. When a relation is in third normal form. Explain with an example
8. Define Boyce/Codd normal form. Does Boyce/Codd normal form and third
normal form are equal? Explain.
9. How do we produce a good design? What relations should we have in the
database? What attributes should these relations have?
10. Define fourth normal form with an example.
11. Explain multivalued dependencies.

2.5. Summary

 Normalization of data can be looked on as a process during which unsatisfactory


relation schemas are decomposed by breaking up their attributes into smaller
relation schemas that possess desirable properties
 A relation is in first normal form (1NF) if and only if all underlying simple


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
domains contain atomic values only.
A relation is in second normal form (2NF) if and only if it is in 1NF and every
non key attribute is fully dependent on the primary key.
 A relation is in third normal form (3NF) if and only if it is in 2NF and every non
key attribute is non transitively dependent on the primary key.
 A relation is in Boyce/Codd normal form (BCNF) if and only if every determinant
is a candidate key. [A determinant is any attribute on which some other attribute
is (fully) functionally dependent.]
 A relation R is in fourth normal form (4NF) if and only if, whenever there exists a
multivalued dependency in the R, say A->>B, then all attributes of R are also
functionally dependent on A.

Page 88
Advanced RDBMS

 Multivalued dependencies are defined essentially as relationships which accept


the "cross-product" maintenance policy
 We should also point out that multivalued dependencies and fourth normal form
apply as well to relationships involving more than two fields
 A relation R is in fifth normal form (5NF)—also called projection-join normal
form (PJNF)—if and only if every join dependency in R is a consequence of the
candidate keys of R.

2.6 Terminal Exercise


1. ________may be defined as a step by step reversible process of transforming an
unnormalised relation into relations with progressively simpler structures.
2. A relation R is in ___________ normal form if and only if, whenever there
exists multivalued dependency in the R, say A->>B, then all attributes of R are
also functionally dependent on A.
3. Define Multivalued Dependency.
4. What is a Constraint?

2.7 Supplementary material


Raghu Ramakrishnana, Johannes Gehrtee, “DBMS Management Systems”, Mcgraw Hill.

2.8 Assignment
Prepare assignment about oracle 8i.

2.9 Reference Books


 Oracle 8i DBA Handbook by Kevin Loney and Marlene Theriault
 www. Dbatoolbox.com
 Oracle DBA101: A Beginners Guide by Rao R. Uppaluri

2.10 Learning Activity


1. Learn and apply the concepts of Normalisation in the SQL Queries.

ANNAMALAI
ANNAMALAI UNIVERSITY
2.11 Keywords UNIVERSITY
1. Normalization
2. Boyce-Codd Normal form
3. 1NF – First Normal Form
4. 2NF – Second Normal Form
5. Multi-Valued Dependency
6. Functional Dependency

Page 89
Advanced RDBMS

UNIT - III

Topics:
 Database System Architecture and The System Catalog
 System Catalog Information
 Data Dictionary and Data Repository Systems
 Query Processing and Optimization: Translating SQL Queries
into Relational Algebra
 Basic Algorithms for Executing Query Operations
 Using Heuristics In Query Optimization
 Query Optimization in Oracle
 Transaction Processing Concepts

3.0. Introduction
The data base system architecture and the system catalog system architecture forms a
base to understand the basic structure and functions of database management system. So
many varieties of database management softwares and several Object linking and
Embedding objects and Broker architectures are necessary for database enthusiasts to
understand in order to effectively implement

3.1. Objective
The objective of this unit is to understand the Database System Architecture, Information
Accesses By DBMS Software Modules such as Data Dictionary and Data Repository
Systems. Query Processing and Optimization is an area of interest to understand the
Basic Algorithms for Executing Query Operations – Using Heuristics In Query
Optimization. Transaction Processing Concepts are explained in terms of Transaction
and System Concepts which includes Schedules and Recoverability.

3.2 Content

3.2.1 Database System Architectures and the System catalog: System Architectures
for DBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.

Data Model Operations: Operations for specifying database retrievals and updates by
referring to the concepts of the data model. Operations on the data model may include
basic operations and user-defined operations.

Page 90
Advanced RDBMS

Variations of Distributed Environments:


• Homogeneous DDBMS
• Heterogeneous DDBMS
• Federated or Multidatabase Systems

Catalogs for Relational DBMS

a. Integrated data.
Integrated data means that the database may be thought of as a unification of several
otherwise distinct data files, with any redundancy among those files either wholly or
partly eliminated.

Consequences of integration are sharing and the idea that any given user will normally be
concerned with only a subset of the total database; moreover, different user's subsets will
overlap in many different ways i.e. a given database will be perceived by different users
in different ways. Also, users may be restricted to certain subsets of data.

b. Definition of Entity.
An entity is any distinguishable real world object that is to be represented in the database;
each entity will have attributes or properties e.g. the entity lecture has the properties
place and time . A set of similar entities is known as an entity type.

c. Network model Overview


A network data structure can be regarded as an extended form of the hierarchic data
structure - the principal distinction between the two being that in a hierarchic structure, a
child record has exactly one parent whereas in a network structure, a child record can
have any number of parents (possibly even zero).

A network database consists of two data sets, a set of records and a set of links, where the
record types are made up of fields in the usual way.
Networks are complicated data structures. Operators on network databases are complex,
functioning on individual records, and not sets of records. Increased complexity does not
mean increased functionality and the network model is no more powerful than the
relational model. However, a network-based DBMS can provide good performance
because its lack of abstraction means it is closer the the storage structured used though
this is at the expense of good user programming. The network model also incorporates
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
certain integrity rules.

d. System Tables
Information about the database is maintained in the system catalogs. These vary from
system to system because the contents of the system catalog is specific to a particular
system. The INFORMIX system contains the following tables in it's system catalog.

 systables - describes database tables


 syscolumns - describes columns in tables

Page 91
Advanced RDBMS

 sysindexes - describes indexes in columns


 systabauth - identifies table-level privileges
 syscolauth - identifies column-level privileges
 sysdepend - describes how views depend on tables
 syssynonyms - lists synonyms for tables
 sysusers - identifies database-level privileges
 sysviews - defines views

3.2.2. System Catalog information in Oracle

System Catalogs

Every DBMS requires information by which it can estimate the cost of various possible
plans that may be use to execute a query, so as to choose the best plan. For this it
maintains histograms also known as catalogs. The catalogs used by Postgres is a
combination of equidepth and end-biased histograms, which leads to accurate prediction
of both, frequently occurring as well as range distribution of data values.

The system catalogs are the place where a relational database management system stores
schema metadata, such as information about tables and columns, and internal
bookkeeping information. PostgreSQL's system catalogs are regular tables.
You can drop and recreate the tables, add columns, insert and update values, and severely
mess up your system that way

 In ORACLE, the collection of metadata is called the data dictionary.

 The metadata is information about schema objects, such as tables, indexes,


views, triggers, and more.

 Access to the data dictionary is allowed through numerous views, which are
divided into three categories: USER, ALL, and DBA.

o USER views contain schema information for objects owned by a given


ANNAMALAI
ANNAMALAI UNIVERSITY
o
UNIVERSITY
user.
ALL views contain schema information for objects owned by a given user
plus objects that the user has been granted access to.
o DBA views are for the DBA and contain information about all database
objects.

 The system catalog contains information about all three levels of database
schemas: external (view definitions), conceptual (base tables), and internal
(storage and index descriptions).

 Sample queries for the different levels of schemas:

Page 92
Advanced RDBMS

o The Conceptual Schema -

SELECT * FROM ALL_CATALOG WHERE OWNER = ‘SMITH’;

 The Internal Schema -

SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS, BLOCKS,


EMPTY_BLOCKS,AVG_ROW_LENGTH FROM USER_TABLES
WHERE TABLE_NAME = ‘ORDERS’;

 The information from USER_TABLES also plays a useful


role in query processing and optimization

SELECT INDEX_NAME, UNIQUENESS, BLEVEL, LEAF_BLOCKS,


DISTINCT_KEYS,AVG_LEAF_BLOCKS_PER_KEY,AVG_DATA_BLOCKS_PER_
KEY FROM USER_INDEXES WHERE TABLE_NAME = ‘ORDERS’;

 The storage information about the indexes is just as important to the


query optimizer as the storage information about the relations. This
information is used by the optimizer in deciding how to execute a query
efficiently .

o The External Schema -

SELECT * FROM USER_VIEWS

Other Catalog Information accesses by DBMS Software Modules

SQL objects (i.e., tables, views, ...) are contained in schemas


Schemas are contained in catalogs Each schema has a single owner
Objects can be referenced with explicit or implicit catalog and schema name
FROM people --unqualified name
FROM sample.people --partially qualified name
FROM cat1.sample.people --fully qualified name

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 93
Advanced RDBMS

a. Predefined types

 Metadata description: The information stored in a catalog of an RDBMS


includes the schema names, relation names, attribute names, and attribute
domains (data types), as well as descriptions of constraints (primary keys,
secondary keys, foreign keys, NULL/NOT NULL, and other types of constraints),
views, and storage structures and indexes.

 Security and authorization information is also kept in the catalog; this describes
each user’s privileges to access specific database relations and views, and the
creator or owner of each relation.

 It is common practice to store the catalog itself as relations.

 Most relational systems store their catalog files as DBMS relations. However,
because the catalog is accessed very frequently by the DBMS modules, it is
important to implement catalog access as efficiently as possible.

 It may be more efficient to use a specialized set of data structures and access
routines to implement the catalog, thus trading generality for efficiency.

 System initialization problem: The catalog tables must be created before the

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
system can function!

3.2.3. Data Dictionary and Data Repository Systems

The data dictionary is the repository for database metadata, which is a fancy term for data
describing the database. When you create a table, your description of that table is
considered metadata, and Oracle stores that metadata in its data dictionary. Similarly,
Oracle stores the definitions for other objects you create, such as views, PL/SQL
packages, triggers, synonyms, indexes, and so forth. The database software uses this
metadata to interpret and execute SQL statements, and to properly manage stored data.

Page 94
Advanced RDBMS

You can use the metadata as your window into the database. Whether you're a DBA or a
developer, you need a way to learn about the objects and data within your database.

Codd's fourth rule for relational database systems states that database metadata must be
stored in relational tables just like any other type of data. Oracle exposes database
metadata through a large collection of data dictionary views. Does this violate Codd's
rule? By no means! Oracle's data dictionary views are all based on tables, but the views
provide a much more user-friendly presentation of the metadata.

For example, to find out the names of all of the relational tables that you own, you can
issue the following query:

SELECT table_name
FROM user_tables;

Note the prefix user_ in this example.

Oracle divides data dictionary views into the three families, as indicated by the following
prefixes:

 USER_

USER views return information about objects owned by the currently-logged-on


database user. For example, a query to USER_TABLES returns a list of all of the
relational tables that you own.

 ALL_

ALL views return information about all objects to which you have access,
regardless of who owns them. For example, a query to ALL_TABLES returns a
list not only of all of the relational tables that you own, but also of all relational
tables to which their owners have specifically granted you access (using the
GRANT command).

 DBA_

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
DBA views are generally accessible only to database administrators, and return
information about all objects in the database, regardless of ownership or access
privileges. For example, a query to DBA_TABLES will return a list of all
relational tables in the database, whether or not you own them or have been
granted access to them. Occasionally, database administrators will grant
developers access to DBA views. Usually, unless you yourself are a DBA, you
won't have access to the DBA views.

Many views have analogs in all three groups. For example, you have USER_TABLES,
ALL_TABLES, and DBA_TABLES. A table is a schema object, and thus owned by a

Page 95
Advanced RDBMS

user, hence the need for USER_TABLES. Table owners can grant specific users access to
their tables, hence the need for ALL_TABLES. Database administrators need to be aware
of all tables in the database, hence the need for DBA_TABLES. In some cases, it doesn't
make sense for a view to have an analog in all groups.

There is no USER_DIRECTORIES view, for example, because directories are database


objects not owned by any one user. However, you will find an ALL_DIRECTORIES
view to show you the directories to which you have access, and you will find a
DBA_DIRECTORIES view to show the database administrator a list of all directories
defined in the database.

Oracle's data dictionary views are mapped onto underlying base tables, but the views
form the primary interface to Oracle's metadata. Unless you have specific reasons to go
around the views directly to the underlying base tables, you should use the views. The
views return data in a much more understandable format than you'll get from querying
the underlying tables. In addition, the views make up the interface that Oracle documents
and supports. Using an undocumented interface, i.e. the base tables, is a risky practice.

The primary source of information on Oracle's many data dictionary views is the Oracle9i
Database Reference manual.

You can access that manual, and many others, from the Oracle Technology Network
(OTN). You have to register with OTN in order to view Oracle's documentation online,
but registration is free. If you prefer a hardcopy reference, Oracle In A Nutshell,
published by O'Reilly & Associates, is another source of Oracle data dictionary
information.

Query Processing and Optimization: Translating SQL Queries into Relational


Algebra
All examples discussed below refer to the COMPANY database shown here which uses
the following schema.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 96
Advanced RDBMS

a. Relational Algebra

The basic set of operations for the relational model is known as the relational algebra.
These operations enable a user to specify basic retrieval requests.

The result of a retrieval is a new relation, which may have been formed from one or more
relations. The algebra operations thus produce new relations, which can be further
manipulated using operations of the same algebra.

A sequence of relational algebra operations forms a relational algebra expression,


whose result will also be a relation that represents the result of a database query
(Or retrieval request).

b. Unary Relational Operations

i. SELECT Operation

SELECT operation is used to select a subset of the tuples from a relation that satisfy a
selection condition. It is a filter that keeps only those tuples that satisfy a qualifying
condition – those satisfying the condition are selected while others are discarded.

Example: To select the EMPLOYEE tuples whose department number is four or those
whose salary is greater than $ 30,000 the following notation is used:
DNO = 4 ( EMPLOYEE) SALARY > 30,000 ( EMPLOYEE)

In general, the select operation is denoted by |< selection condition>(R) where the
symbol ∑ (sigma) is used to denote the select operator, and the selection condition is a
Boolean expression specified on the attributes of relation R

ii. SELECT Operation Properties

The SELECT operation < selection condition>(R) produces a relation S


That has the same schema as R

The SELECT operation is commutative; i.e., < condition1>(< condition2> (R)) =


<condition2> ( <condition1> (R))
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
A cascaded SELECT operation may be applied in any order; i.e.,
< condition1>(<condition2> (<condition3> (R))=<condition2> (<condition3> (
<condition1> (R)))

A cascaded SELECT operation may be replaced by a single selection with a conjunction


of all the conditions; i.e., <condition1>(<condition2> (<condition3> (R))=<condition1>
AND < condition2> AND < condition3> (R)))

Results of SELECT and PROJECT operations

Page 97
Advanced RDBMS

iii. PROJECT Operation

This operation selects certain columns from the table and discards the other columns. The
PROJECT creates a vertical partitioning – one with the needed columns (attributes)
containing results of the operation and other containing the discarded Columns.
Example: To list each employee’s first and last name and salary, the following is used:

NAME, FNAME, SALARY (EMPLOYEE)

The general form of the project operation is <attribute list>(R) where (pi) is the
symbol used to represent the project operation and <attribute list> is the desired list of
attributes from the attributes of relation R.

The project operation removes any duplicate tuples, so the result of the project operation
is a set of tuples and hence a valid relation.

The number of tuples in the result of projection

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
<list> (R)is always less or equal to the number of tuples in R.

If the list of attributes includes a key of R, then the number of tuples is equal to the
number of tuples in R.

<list1> (<list2> (R) )=list1> (R) as long as <list2> contains the attributes in <list2>

iv. Rename Operation

We may want to apply several relational algebra operations one after the other.

Page 98
Advanced RDBMS

Either we can write the operations as a single relational algebra expression by nesting
the operations, or we can apply one operation at a time and create
intermediate result relations. In the latter case, we must give names to the relations that
hold the intermediate results.

Example: To retrieve the first name, last name, and salary of all employees
who work in department number 5, we must apply a select and a project operation.
We can write a single relational algebra expression as follows:

FNAME, LNAME, SALARY(DNO=5(EMPLOYEE))


OR We can explicitly show the sequence of operations, giving a name to each
intermediate relation:

DEP5_EMPS
DNO=5(EMPLOYEE)
RESULT
FNAME, LNAME, SALARY (DEP5_EMPS)

The rename operator is

The general Rename operation can be expressed by any of the following forms:
Ñ
S (B1, B2, …, Bn )(R) is a renamed relation S based on R with column names B1, B1,
…..Bn.
-
ñ
S ( R) is a renamed relation S based on R (which does not specify column names).
-
ñ
(B1, B2, …, Bn )(R) is a renamed relation with column names B1, B1, …..Bn which does
not specify a new relation name.

Relational Algebra Operations From Set Theory

a. UNION Operation
The result of this operation, denoted by R
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
S, is a relation that includes all tuples that are either in R or in S or in both R and S.
Duplicate tuples are eliminated.

The union operation produces the tuples that are in either RESULT1 or RESULT2 or
both. The two operands must be “type compatible”.

Page 99
Advanced RDBMS

b. Type Compatibility

The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) must have the same
number of attributes, and the domains of corresponding attributes must be compatible;
that is, dom(Ai)=dom(Bi) for i=1, 2, ..., n.

The resulting relation for R1R2,R1 R2, or R1-R2 has the same attribute names as the
first operand relation R1

UNION Example

STUDENT U INSTRUCTOR

c. Intersection Operation

The result of this operation, denoted by R

S, is a relation that includes all tuples that are in both R and S. The two operands must be
"type compatible" Example: The result of the intersection operation
Instructor Students

includes only those who are both students and instructors.

STUDENT
INSTRUCTOR

d. Set Difference (or MINUS) Operation

The result of this operation, denoted by R. -S, is a relation that includes all tuples that are
in R but not in S. The two operands must be " type compatible”.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example: The figure shows the names of students who are not instructors, and
the names of instructors who are not students.
STUDENT-INSTRUCTOR
INSTRUCTOR-STUDENT

Notice that both union and intersection are commutative operations

Both union and intersection can be treated as n-ary operations applicable to any number
of relations as both are associative operations;

Page 100
Advanced RDBMS

The minus operation is not commutative; that is, in general


R-S‚S–R
e. CARTESIAN (or cross product) Operation

This operation is used to combine tuples from two relations in a combinatorial fashion. In
general, the result of R(A1, A2, . ., An) x S(B1,B2, ..., Bm) is a relation Qwith degree
n+m attributes Q(A1, A2, .., An, B1, B2, ..., Bm), in that order. The resulting relation Q
has one tuple for each combination of tuples—one from Rand one from S.

Hence, if Rhas nR tuples (denoted as |R| =nR ), and Shas nS tuples, then|RxS |will have
nR * nS tuples.

The two operands do NOT have to be "type compatible”

Example:
FEMALE_EMPS
SEX=’F’(EMPLOYEE)
EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS •© EMPNAMES xDEPENDENT

Binary Relational Operations

a. JOIN Operation

The sequence of cartesian product followed by select is used quite commonly to identify
and select related tuples from two relations, a special operation, called JOIN.

This operation is very important for any relational database with more than a single
relation, because it allows us to process relationships among relations.

The general form of a join operation on two relations R(A1, A2,., An) and S(B1, B2, ..,
Bm) is:

R < join condition>S where R and S can be any relations that result from general
relational algebra expressions.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example: Suppose that we want to retrieve the name of the manager of each department.
To get the manager’s name, we need to combine each DEPARTMENT tuple with the
EMPLOYEE tuple whose SSN value matches the MGRSSN value in the department
tuple. We do this by using the join operation.

DEPT_MGR
DEPARTMENT MGRSSN=SSNEMPLOYEE

b. EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons only.

Page 101
Advanced RDBMS

Such a join, where the only comparison operator used is =, is called an EQUIJOIN. In the
result of an EQUIJOIN we always have one or more pairs of attributes that have identical
values in every tuple.

The JOIN seen in the previous example was EQUIJOIN.

c. NATURAL JOIN Operation


Because one of each pair of attributes with identical values is superfluous, a New
operation called natural join—denoted by *—was created to get rid of the second
(superfluous) attribute in an EQUIJOIN condition.

The standard definition of natural join requires that the two join attributes, or each pair
of corresponding join attributes, have the same name in both relations. If this is not the
case, a renaming operation is applied first.

Example: To apply a natural join on the DNUMBER attributes of DEPARTMENT and


DEPT_LOCATIONS, it is sufficient to write:

DEPT_LOCS = DEPARTMENT * DEPT_LOCATIONS

The set of operations including select, project union, set difference - , and cartesian
product X
is called a complete set because any other relational algebra expression can be expressed
by a combination of these five operations.

For example:
R <S=(R+ S)–((R-S) •¾(S-R))R<join condition>S=ó<join condition> (RXS)

d. DIVISION Operation

The division operation is applied to two relations


R(Z) ÷S(X), where X subset Z. Let Y = Z – X); that is, let Y be the set of attributes of R
that are not attributes of S.

The result of DIVISION is a relation T(Y) that includes a tuple t if tuples tR appear in R

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
with tR [Y] =t, and with tR [X] =ts for every tuple ts in S.

For a tuple t to appear in the result T of the DIVISION, the values in t must appear in R
in combination with every tuple in S.

Additional Relational Operations

a. Aggregate Functions and Grouping

A type of request that cannot be expressed in the basic relational algebra is to specify
mathematical aggregate functions on collections of values from the database.

Page 102
Advanced RDBMS

Common functions applied to collections of numeric values include SUM, AVERAGE,


MAXIMUM, and MINIMUM. The COUNT function is used for counting tuples or
values.

b. Use of the Functional operator F

FMAX Salary (Employee) retrieves the maximum salary value


from the Employee relation

FMIN Salary (Employee) retrieves the minimum Salary value from the Employee
relation

FSUM Salary ( Employee) retrieves the sum of the Salary from the Employee relation
DNO FCOUNT SSN, AVERAGE Salary ( Employee) groups employees by DNO
(department number) and computes the count of employees and average salary per
department
c. Recursive Closure Operations
Another type of operation that, in general, cannot be specified in the basic original
relational algebra is recursive closure. This operation is applied to a recursive
relationship.

An example of a recursive operation is to retrieve all SUPERVISEES of an EMPLOYEE


e at all levels— that is, all EMPLOYEE e’ directly supervised by e; all employees e’’
directly supervised by each employee e’; all employees e’’’ directly supervised by each
employee e’’; and so on .

d. The OUTER JOIN Operation

In NATURAL JOIN tuples without a matching (or related) tuple are eliminated from the
join result. Tuples with null in the join attributes are also eliminated.

This amounts to loss of information.

A set of operations, called outer joins, can be used when we want to keep all the tuples in
R, or all those in S, or all those in both relations in the result of the join, regardless of
whether or not they have matching tuples in the other relation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The left outer join operation keeps every tuple in the first or left relation R In R S; if no
matching tuple is found in S, then the attributes of S in the join result are filled or “
padded” with null values.

A similar operation, right outer join, keeps every tuple in the second or right relation S in
the result of RS.

A third operation, full outer join, denoted by keeps all tuples in both the left and the right
relations when no matching tuples are found, padding them with null values as needed.

Page 103
Advanced RDBMS

e. OUTER UNION Operations

The outer union operation was developed to take the union of tuples from two
relations if the relations are not union compatible.

This operation will take the union of tuples in two relations R(X, Y) and S(X, Z)
that are partially compatible, meaning that only some of their attributes, say X,
are union compatible.

The attributes that are union compatible are represented only once in the result,
and those attributes that are not union compatible from either relation are also
kept in the result relation T(X, Y, Z).

3.2.4 Relational Calculus

A relational calculus expression creates a new relation, which is specified in terms of


variables that range over rows of the stored database relations ( in tuple calculus) or over
columns of the stored relations ( in domain calculus).

In a calculus expression, there is no order of operations to specify how to retrieve the


query result— a calculus expression specifies only what information the result should
contain. This is the main distinguishing feature between relational algebra and relational
calculus.

Relational calculus is considered to be a Nonprocedural, language. This differs from


relational algebra, where we must write a sequence of operations to specify a retrieval
request; hence relational algebra can be considered as a procedural way of stating a
query.

Tuple Relational Calculus

The tuple relational calculus is based on specifying a number of tuple variables. Each
tuple variable usually ranges over a particular database relation, meaning that the
variable may take as its value any individual tuple from that relation.

A simple tuple relational calculus query is of the form


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
{ t | COND(t)} where t is a tuple variable and COND (t) is a conditional expression
involving t. The result of such a query is the set of all tuples t that satisfy COND (t).

Example: To find the first and last names of all employees whose salary is above
$ 50,000, we can write the following tuple calculus expression:
{ t.FNAME, t.LNAME | EMPLOYEE(t) AND t.SALARY>50000}
The condition EMPLOYEE(t) specifies that the range relation of tuple variable t

Is EMPLOYEE. The first and last name (PROJECTION FNAME, LNAME) of each
EMPLOYEE tuple t that satisfies the condition t.SALARY>50000

Page 104
Advanced RDBMS

(SELECTION SALARY >50000) will be retrieved.

3.2.5 Executing Query Operations

Basic Algorithms

a. External Sorting
 Sorting is one of the primary algorithms used in Query processing (eg., ORDER
BY-clause requires a sorting).
 External Sorting is used for large files of records stored on disk that do not fit
entirely in main memory.
 The typical external sorting algorithm uses a sort-merge strategy. The algorithm
consists of two phases:
1. Sorting Phase
2. Merging Phase

b. Implementing the Select Operation:

A number of search algorithms are possible for selecting records from a file.

The following search methods are available:


1. Linear Search (brute force)
2. Binary Search
3. Using a Primary index (or hash function)
4. Using a Primary index to retrieve multiple records.
5. Using a Clustering index to retrieve multiple records.
6. Using a secondary(B+ -tree) index on an equality comparison.

c. Implementing the JOIN Operation

 The JOIN operation is one of the most time consuming operations in query
processing.
 Four of the most common techniques for performing a join are as following:
1. Nested-loop join (brute force)
2. Single loop join (Using an access structure to retrieve the matching

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
records).
3. Sort-merge join.
4. Hash join.

d. Implementing PROJECT and Set operations

 Implementation of a PROJECT operation is straightforward if attribute list


includes a key of relation R.
 If attribute list does not include a key relation R, duplicate tuples must be
eliminated.

Page 105
Advanced RDBMS

 Set operations (∩, U, X, ─) are sometimes expensive to implement. In particular


the Cartesian product operation is quite expensive.
 Since Union, Intersections, Set difference apply only to union-compatible
relations, their implementation can be done by using some variations of the
Sort-merge technique.

 Hashing can also be used to implement UNION, INTERSECTION and SET


DIFFERENCE.

e. Implementing Aggregate Operations

 The aggregate operations (MIN, MAX, COUNT, AVERAGE, SUM), when


applied to an entire table, can be computed by a table scan or by using an
appropriate index, for example,
SELECT MAX (SALARY)
FROM EMPLOYEE;
 If an ascending index on salary exists, for the employee relation, it can be used
(otherwise we can scan the entire table).
 The index can also be used for the COUNT, AVERAGE and SUM aggregates.
 When a GROUP BY clause is used in a query , the aggregate operator must be
applied separately to each group of tuples.

Using Heuristics in Query optimization

a. The Existential and Universal Quantifiers

Two special symbols called quantifiers can appear in formulas; these are the
universal quantifier () and the existential quantifier ().

Informally, a tuple variable t is bound if it is quantified, meaning that it appears in an (t)


or (t) clause; otherwise, it is free.

If F is a formula, then so is (t)(F), where t is a tuple variable. The formula (t)(F) is true if
the formula F evaluates to true for some tuple assigned to free occurrences of t in F;
otherwise (t)(F) is false.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If F is a formula, then so is (t)(F), where t is a tuple variable. The formula(t)(F) is true if
the formula F evaluates to true for every tuple) assigned to free occurrences of t in F;
otherwise (t)(F) is false.

It is called the universal or “for all” quantifier because every tuple in “ the universe of”
tuples must make F true to make the quantified formula true.

b. Example Query Using Existential Quantifier

Retrieve the name and address of all employees who work for the ‘Research’department.

Page 106
Advanced RDBMS

Query :
{t.FNAME, t.LNAME, t.ADDRESS |EMPLOYEE(t) and (d)(DEPARTMENT(d)
and d.DNAME=‘Research’ and d.DNUMBER=t.DNO) }

The only free tuple variables in a relational calculus expression should be those that
appear to the left of the bar (| ). In above query, t is the only free variable; it is then bound
successively to each tuple. If a tuple satisfies the conditions specified in the query, the
attributes FNAME, LNAME, and
ADDRESS are retrieved for each such tuple.

The conditions EMPLOYEE (t) and DEPARTMENT(d) specify the range relations for t
and d. The condition d.DNAME =‘Research’ is a selection condition and corresponds to
a SELECT operation in the relational algebra, whereas the condition d.DNUMBER =
t.DNO is a JOIN condition.

Exclude from the universal quantification all tuples that we are not interested in
by making the condition true for all such tuples. The first tuples to exclude are those that
are not in the relation R
of interest.

In query above, using the expression not(PROJECT(x)) inside the universally quantified
formula evaluates to true all tuples x that are not in the PROJECT relation. Then we
exclude the tuples we are not interested in from R itself. The expression not(x.DNUM=5)
evaluates to true all tuples x that are in the project relation but are not controlled by
department 5.

Finally, we specify a condition that must hold on all the remaining tuples in R.
(( (w) and w.ESSN=e.SSN and x.PNUMBER=w.PNO)

Languages Based on Tuple Relational Calculus

The language SQL is based on tuple calculus. It uses the basic SELECT <list of
attributes>FROM <list of relations>WHERE <conditions> block structure to express the
queries in tuple calculus where the SELECT clause mentions the attributes being
projected, the FROM clause mentions the relations needed in the query, and the WHERE
clause mentions the selection as well as the join conditions.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SQL syntax is expanded further to accommodate other operations. Another language
which is based on tuple calculus is QUEL which actually uses the range variables as in
tuple calculus.

Its syntax includes:

RANGE OF <variable name> IS <relation name>Then it uses RETRIEVE <list of


attributes from range variables>WHERE <conditions>
This language was proposed in the relational DBMS INGRES.

Page 107
Advanced RDBMS

The Domain Relational Calculus

Another variation of relational calculus called the domain relational calculus, or


simply, domain calculus is equivalent to tuple calculus and to relational algebra.

The language called QBE (Query-By-Example) that is related to domain calculus was
developed almost concurrently to SQL at IBM Research, Yorktown Heights, New
York. Domain calculus was thought of as a way to explain what QBE does.

Domain calculus differs from tuple calculus in the type of variables used in formulas:
rather than having variables range over tuples, the variables range over single values
from domains of attributes. To form a relation of degree n for a query result, we must
have n of these domain variables—one for each attribute.

An expression of the domain calculus is of the form {x1, x2, ..., xn |COND(x1, x2, ..., xn,
xn+1, xn+2, .., xn+m)} where x1, x2, .., xn, xn+1, xn+2, .., xn+m are domain variables
that range over domains and COND is a condition or formula of the domain relational
calculus.

Retrieve the birthdate and address of the employee whose name is ‘John B.Smith’.

Query :
{uv |( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z)
(EMPLOYEE(qrstuvwxyz) and q=’John’ and r=’B’ and s=’Smith’)}

Ten variables for the employee relation are needed, one to range over the
domain of each attribute in order. Of the ten variables q, r, s, .., z, only u and v are free.

Specify the requested attributes, BDATE and ADDRESS, by the free domain
variables u for BDATE and v for ADDRESS.

Specify the condition for selecting a tuple following the bar (|)—namely, that the
sequence of values assigned to the variables qrstuvwxyz be a tuple of the employee
relation and that the values for q(FNAME), r(MINIT), and s(LNAME) be ‘John’, ‘B’,
and ‘Smith’, respectively.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
3.2.6. QBE: A Query Language Based on Domain Calculus

This language is based on the idea of giving an example of a query using example
elements.

An example element stands for a domain variable and is specified as an example value
preceded by the underscore character.

P. (called Pdot) operator (for “print”) is placed in those columns which are requested for
the result of the query.

Page 108
Advanced RDBMS

A user may initially start giving actual values as examples, but later can get used to
providing a minimum number of variables as example elements.

The language is very user-friendly, because it uses minimal syntax.

QBE was fully developed further with facilities for grouping, aggregation, updating etc.
and is shown to be equivalent to SQL.

The language is available under QMF (Query Management Facility) of DB2 of IBM and
has been used in various ways by other products like ACCESS of Microsoft, PARADOX.

QBE Examples

QBE initially presents a relational schema as a “blank schema” in which the user fills in
the query as an example:

The following domain calculus query can be successively minimized by the user as
shown:
ANNAMALAI
ANNAMALAI UNIVERSITY
Query : UNIVERSITY
{uv |( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z)
(EMPLOYEE(qrstuvwxyz) and q=’John’ and r=’B’ and s=’Smith’)}

Specifying complex cinditions in QBE:

Page 109
Advanced RDBMS

A technique called the “ condition box” is used in QBE to state more involved Boolean
expressions as conditions.

The D.4(a) gives employees who work on either project 1 or 2, whereas the query in
D.4(b) gives those who work on both the projects.

Illustrating join in QBE. The join is simple accomplished by using the same example
element in the columns being joined. Note that the Result is set us as an independent
table

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

3.2.7 Using Selectivity and Cost estimates in Query Optimization

Optimisation is the process of choosing the most efficient way to execute a SQL
statement. The cost-based optimiser uses statistics to calculate the selectivity of

Page 110
Advanced RDBMS

predicates and to estimate the cost of each execution plan.


You must gather statistics on a regular basis to provide the optimiser with information
about schema objects. New statistics should be gathered after a schema object's data or
structure are modified in ways that make the previous statistics inaccurate.

Statistics for Partitioned Schema Objects

Partitioned schema objects may contain multiple sets of statistics. They can have
statistics which refer to the entire schema object as a whole ( global statistics ), they can
have statistics which refer to an individual partition, and they can have statistics which
refer to an individual sub-partition of a composite partitioned object.
Unless the query predicate narrows the query to a single partition, the optimiser
uses the global statistics. Because most queries are not likely to be this restrictive,
it is most important to have accurate global statistics. Therefore, actually
gathering global statistics with the DBMS_STATS package is highly
recommended.

Using the DBMS_STATS Package

The PL/SQL package DBMS_STATS lets you generate and manage statistics for
cost-based optimization. For partitioned tables and indexes, DBMS_STATS can
gather separate statistics for each partition as well as global statistics for the entire
table or index.

One approach to gather statistics is using the GATHER_SCHEMA_STATS


procedure. This procedure supports several features:
o Gather global statistics for tables and indexes for a whole schema.
o Gather statistics for stale statistics (see example below), for empty
statistics, or gather all statistics. Set OPTIONS to either GATHER
STALE, GATHER EMPTY, or just GATHER.
o Gather statistics for all levels. Set GRANULARITY to ALL and
CASCADE to TRUE.
o Gather statistics in parallel. Set DEGREE higher than the number
of CPUs.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
 gather stale statistics (use with MONITORING option)

execdbms_stats.gather_schema_stats( -
ownname => 'ABC', -
estimate_percent => 0.5, -
method_opt => 'FOR ALL COLUMNS SIZE 1', -
degree => 8, -
granularity => 'ALL', -
options => 'GATHER STALE', - cascade => TRUE -
);

Page 111
Advanced RDBMS

a. Unused Index

Assume you have a table with some thousand or more records. Every record has a
type field to indicate type of this entry. The distribution of the type is:

COUNT(*) TYPE
---------- ----------
94 0
3011 1

If you select all records of type 0 the optimiser should take the index on type
column for optimal performance. However the optimiser decides to run a full
table scans instead:

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=25 Card=1400
Bytes=64400)
1 0 SORT (ORDER BY) (Cost=25 Card=1400 Bytes=64400)
2 1 TABLE ACCESS (FULL) OF 'MAIL_SERVER' (Cost=3
Card=1400 Bytes=6...

Even if you re-calculate the global statistics after creating the index or after data
load the optimiser does not use this index.
What’s wrong with the optimiser?

b. Use Histograms

The cost-based optimiser uses data value histograms to get accurate estimates of
the distribution of column data. Histograms provide improved selectivity
estimates in the presence of data skew, resulting in optimal execution plans with
non-uniform data distributions.

Histograms can affect performance and should be used only when they
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
substantially improve query plans. They are useful only when they reflect the
current data distribution of a given column. If the data distribution of a column
changes frequently, you must re-compute its histogram frequently.
One approach to gather histogram statistics on specified tables or table columns is
using the GATHER_TABLE_STATS procedure in the same package:

 gather statistics with histograms

exec dbms_stats.gather_table_stats( -
ownname => 'ABC', -

Page 112
Advanced RDBMS

tabname =>'MAIL_SERVER', -
method_opt => 'FOR COLUMNS SIZE 10 SERVER_TYPE', -
degree => 8 -
);

The same query for all records of type 0 will result in the following execution
plan:

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=94
Bytes=4606)
1 0 SORT (ORDER BY) (Cost=6 Card=94 Bytes=4606)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'MAIL_SERVER'
(Cost=3 Card=94...
3 2 INDEX (RANGE SCAN) OF 'IDX_MAISER_SERVER_TYPE'
(Cost=1 Card=94)

But if you revert the query by selecting all records of type 1 the optimiser
makes a full table scan which is the optimal solution in this case:

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=53 Card=3011
Bytes=147539)
1 0 SORT (ORDER BY) (Cost=53 Card=3011 Bytes=147539)
2 1 TABLE ACCESS (FULL) OF 'MAIL_SERVER' (Cost=3
Card=3011 Bytes=...

c. Global Statistics vs. Histograms

The “price” of this approach (using histograms) is that you are losing global
statistics. This is true for Oracle 8i and should not be a limitation on Oracle 9i.
With Oracle 8i you have to decide whether you need global statistics or
histograms on the same table or index. Alternative solutions might be:

ANNAMALAI
ANNAMALAI UNIVERSITY
o UNIVERSITY
Switch the optimiser: The hint /*+ RULE */ would switch from
cost-based to rule-based optimiser. And the rule-based optimiser
assumes that using an index is the best solution.
o Use an index hint: The hint /*+ INDEX(<table name> <index
name>) */ would lead the cost-based optimiser to make use of the
given index.
o Migrate to Oracle 9i: Maybe you have the chance or one more
argument to migrate!

Page 113
Advanced RDBMS

The package DBMS_STATS can be used to gather global statistics. Please note
that Oracle 8i currently does not gather global histogram statistics. It is most
important to have accurate global statistics for partitioned schema objects.

Histograms can affect performance and should be used only when they
substantially improve query plans. But Oracle 8i does not support global statistics
and histograms on the same objects. The database designer has to decide how to
go around this limitation. Possible alternative solutions are optimiser hints.

3.2.8. Query optimization in Oracle

Query Optimisation: Query Execution Algorithms, Heuristics in Query Execution, Cost


Estimation in Query Execution, Semantic Query Optimisation. Database Transactions
and Recovery Procedures: Transaction Processing Concepts, Transaction and System
Concepts, Desirable Properties of a Transaction, Schedules and Recoverability,
Serializability of Schedules, Transaction Support in SQL, Recovery Techniques,
Database Backup, Concurrency control, Locking techniques for Concurrency Control,
Concurrency Control Techniques, Granularity of Data Items.

Semantic Query Optimization

We present a technique for semantic query optimization (SQO) for object databases. We
use the ODMG-93 standard ODL and OQL languages. The ODL object schema and the
OQL object query are translated into a DATALOG representation. Semantic knowledge
about the object model and the particular application is expressed as integrity constraints.
This is an extension of the ODMG-93 standard. SQO is performed in the DATALOG
representation and an equivalent logic query, and subsequently an equivalent OQL

Speeding up a system's performance is one of the major goals of machine learning.


Explanation-based learning is typically used for speedup learning, while applications of
inductive learning are usually limited to data classifers. In this paper, we present an ap-
proach in which inductively learned knowledge is used for semantic query optimization
to speed up query answering for data/knowledge-based systems.

The principle of semantic query optimization (King1981) is to use semantic rules, such as
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
all Tunisian seaports have railroad access, to reformulate a query into a less expensive
but equivalent query, so as to reduce the query evaluation cost. For example, suppose we
have a query to _nd all Tunisian seaports with rail road access and 2,000,000 ft3 of
storage space. From the rule given above, we can reformulate the query so that there is no
need to check the railroad accss of seaports, which may save some execution time.

Two queries are semantically equivalent if they return the same answer for
any database state satisfying a given set of integrity constraints.

A semantic transformation transforms a given query into a semantically


equivalent one.

Page 114
Advanced RDBMS

Semantic query optimization is the process of determining the set of semantic


transformations that results in a semantically equivalent query with a lower
execution cost.

ODB-QOptimizer determines more specialized classes to be accessed and


reduces the number of factors by applying the Integrity Constraint Rules.

Transaction processing concepts

Introduction to Transaction Processing

Single-User System: At most one user at a time can use the system.

Multiuser System: Many users can access the system concurrently.

Interleaved processing: concurrent execution of processes is interleaved in a single


CPU

Parallel processing: processes are concurrently executed in multiple CPUs.

Transaction: logical unit of database processing that includes one or more access
operations (read -retrieval,write -insert or update, delete).

transaction (set of operations) may be standalone specified in a high level language


like SQL submitted interactively, or may be embedded within a program.

Transaction boundaries: Begin and End transaction.

An application program may contain several transactions separated by the Begin and
End transaction boundaries.

SIMPLE MODEL OF A DATABASE (for purposes of discussing transactions):


database - collection of named data items

Granularity of data –a field, a record ,or a whole disk block

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Basic operations are read and write

read_item(X): Reads a database item named X into a program variable. To simplify our
notation, we assume that the program variable is also named X.

write_item(X): Writes the value of program variable X into the database item named X.

Page 115
Advanced RDBMS

a. READ AND WRITE OPERATIONS:

Basic unit of data transfer from the disk to the computer main memory is one block. In
general, a data item (what is read or written) will be the field of some record in the
database, although it may be a larger unit such as a record or even a whole block.

read_item(X) command includes the following steps:

1. Find the address of the disk block that contains item X.


2. Copy that disk block into a buffer in main memory ( if that disk block is not already in
some main memory buffer).
3. Copy item X from the buffer to the program variable named X.

write_item(X) command includes the following steps:

1. Find the address of the disk block that contains item X.


2. Copy that disk block into a buffer in main memory (if that disk block is not already in
some main memory buffer).
3. Copy item X from the program variable named X into its correct location in the buffer.
4. Store the updated block from the buffer back to disk ( either immediately or at some
later point in time).

Two sample transactions. (a) Transaction T1. (b) Transaction T2.

Why Concurrency Control is needed:

b. The Lost Update Problem.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
This occurs when two transactions that access the same database items have their
operations interleaved in a way that makes the value of some database item incorrect.

c. The Temporary Update (or Dirty Read) Problem.

This occurs when one transaction updates a database item and then the transaction fails
for some reason. The updated item is accessed by another transaction before it is changed
back to its original value.

Page 116
Advanced RDBMS

d. The Incorrect Summary Problem

If one transaction is calculating an aggregate summary function on a number of records


while other transactions are updating some of these records, the aggregate function may
calculate some values before they are updated and others after they are updated.

Some problems that occur when concurrent execution is uncontrolled.

(a) The lost updateproblem.


Some problems that occur when concurrent execution is uncontrolled.
(b) The temporary update problem.
Some problems that occur when concurrent execution is uncontrolled.
(c) The incorrect summary problem.

e. Why recovery is needed

(What causes a Transaction to fail)

1. A computer failure (system crash): A hardware or software error occurs in the


computer system during transaction execution. If the hardware crashes, the contents of
the computer’s internal memory may be lost.

2. A transaction or system error : Some operation in the transaction may cause it to


fail, such as integer overflow or division by zero. Transaction failure may also occur
because of erroneous parameter values or because of a logical programming error. In
addition, the user may interrupt the transaction during its execution.

3. Local errors or exception conditions detected by the transaction: - certain conditions


necessitate cancellation of the transaction. For example, data for the transaction may
not be found. A condition, such as insufficient account balance in a banking database,
may cause a transaction, such as a fund withdrawal from that account, to be canceled.
- a programmed abort in the transaction causes it to fail.

4. Concurrency control enforcement: The concurrency control method may decide to


abort the transaction, to be restarted later, because it violates serializability or because
several transactions are in a state of deadlock
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
5. Disk failure: Some disk blocks may lose their data because of a read or write
malfunction or because of a disk read/write head crash. This may happen during
a read or a write operation of the transaction.

6. Physical problems and catastrophes: This refers to an endless list of problems that
includes power or air-conditioning failure, fire, theft, sabotage, overwriting disks or tapes
by mistake, and mounting of a wrong tape by the operator.

Page 117
Advanced RDBMS

Transaction and System Concepts

A transaction is an atomic unit of work that is either completed in its entirety or not
done at all. For recovery purposes, the system needs to keep track of when the transaction
starts, terminates, and commits or aborts.

Transaction states:
Active state
Partially committed state
Committed state
Failed state
Terminated State

Recovery manager keeps track of the following operations:

begin_transaction: This marks the beginning of transaction execution.

read or write: These specify read or write operations on the database items that are
executed as part of a transaction.

end_transaction: This specifies that read and write transaction operations have ended
and marks the end limit of transaction execution. At this point it may be necessary to
check whether the changes introduced by the transaction can be permanently applied to
the database or whether the transaction has to be aborted because it violates concurrency
control or for some other reason.

commit_transaction: This signals a successful end of the transaction so that any changes
( updates) executed by the transaction can be safely committed to the database and will
not be undone.

rollback (or abort): This signals that the transaction has ended unsuccessfully, so that
any changes or effects that the transaction may have applied to the database must be
undone.

Recovery techniques use the following operators:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
undo: Similar to rollback except that it applies to a single operation rather than to a
whole transaction.

redo: This specifies that certain transaction operations must be redone to ensure that all
the operations of a committed transaction have been applied successfully to the database.

State transition diagram illustrating the states for transaction execution.

Page 118
Advanced RDBMS

The System Log

Log or Journal :
The log keeps track of all transaction operations that affect the values of database items.
This information may be needed to permit recovery from transaction failures. The log is
kept on disk, so it is not affected by any type of failure except for disk or catastrophic
failure. In addition, the log is periodically backed up to archival storage (tape) to guard
against such catastrophic failures.
The transaction-id that is generated automatically by the system and is used to identify
each transact

Types of log record

1. [start_transaction,T]: Records that transaction T has started execution.


2. [write_item,T,X,old_value,new_value]: Records that transaction T has changed the
value of database item X from old_value to new_value.
3. [read_item,T,X]: Records that transaction T has read the value of database item X.
4. [commit,T]: Records that transaction T has completed successfully, and affirms that its
effect can be committed (recorded permanently) to the database.
5. [abort,T]: Records that transaction T has been aborted. protocols for recovery that
avoid cascading rollbacks do not require that read operations be written to the system log,
whereas other protocols require these entries for recovery strict protocols require simpler
write entries that do not include new_value

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Recovery using log records

If the system crashes, we can recover to a consistent database state by examining the log
and using one of the techniques
1. Because the log contains a record of every write operation that changes the value of
some database item, it is possible to undo the effect of these write operations of a
transaction T by tracing backward through the log and resetting all items changed by a
write operation of T to their old_values.
2. We can also redo the effect of the write operations of a transaction T by tracing
forward through the log and setting all items changed by a write operation of T to their
new_values.

Page 119
Advanced RDBMS

3.2.9. Desirable properties of Transaction

Commit Point of a Transaction:

Definition: A transaction T reaches its commit point when all its operations that access
the database have been executed successfully and the effect of all the transaction
operations on the database has been recorded in the log. Beyond the commit point, the
transaction is said to be committed, and its effect is assumed to be permanently recorded
in the database. The transaction then writes an entry [ commit,T] into the log.

Roll Back of transactions: Needed for transactions that have a [ start_transaction,T]


entry into the log but no commit entry [ commit,T] into the log.

Redoing transactions: Transactions that have written their commit entry in the log must
also have recorded all their write operations in the log; otherwise they would not be
committed, so their effect on the database can be redone from the log entries.
( Notice that the log file must be kept on disk. At the time of a system crash, only the log
entries that have been written back to disk are considered in the recovery process because
the contents of main memory may be lost.)

Force writing a log: before a transaction reaches its commit point, any portion of the log
that has not been written to the disk yet must now be written to the disk. This process is
called force- writing the log file before committing a transaction

a. ACID properties:

Atomicity: A transaction is an atomic unit of processing; it is either performed in its


entirety or not performed at all.

Consistency preservation: A correct execution of the transaction must take the database
from one consistent state to another.

Isolation: A transaction should not make its updates visible to other transactions until it
is committed; this property, when enforced strictly, solves the temporary update problem
and makes cascading rollbacks of transactions unnecessary.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Durability or permanency: Once a transaction changes the database and the changes
are committed, these changes must never be lost because of subsequent failu

Schedules and Recoverability

Transaction schedule or history: When transactions are executing concurrently in an


interleaved fashion, the order of execution of operations from the various transactions
forms what is known as a transaction schedule

A schedule ( or history) S of n transactions T1, T2, ..., Tn :

Page 120
Advanced RDBMS

It is an ordering of the operations of the transactions subject to the constraint that, for
each transaction Ti that participates in S, the operations of T1 in S must appear in the
same order in which they occur in T1. Note, however, that operations from other
transactions Tj can be interleaved with the operations of Ti in S.

Schedules classified on recoverability:

Recoverable schedule: One where no transaction needs to be rolled back. A schedule S


is recoverable if no transaction T in S commits until all transactions T’ that have written
an item that T reads have committed.

Cascadeless schedule: One where every transaction reads only the items that are written
by committed transactions.

Schedules requiring cascaded rollback: A schedule in which uncommitted transactions


that read an item from a failed transaction must be rolled back.

Strict Schedules: A schedule in which a transaction can neither read or write an item X
until the last transaction that wrote X has committed.

Serializability of Schedules

Characterizing Schedules based on Serializability

Serial schedule: A schedule S is serial if, for every transaction T participating in the
schedule, all the operations of T are executed consecutively in the schedule. Otherwise,
the schedule is called nonserial schedule.

Serializable schedule: A schedule S is serializable if it is equivalent to some serial


schedule of the same n transactions.

Result equivalent: Two schedules are called result equivalent if they produce the same
final state of the database.

Conflict equivalent: Two schedules are said to be conflict equivalent if the order of any
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
two conflicting operations is the same in both schedules.

Conflict serializable: A schedule S is said to be conflict serializable if it is conflict


equivalent to some serial schedule S’.

Being serializable is not the same as being serial. Being serializable implies that the
schedule is a correct schedule. It will leave the database in a consistent state. The
interleaving is appropriate and will result in a state as if the transactions were serially
executed, yet will achieve efficiency due to concurrent execution.

Page 121
Advanced RDBMS

Serializability is hard to check. Interleaving of operations occurs in an operating system


through some scheduler Difficult to determine beforehand how the operations in a
schedule will be interleaved.

Practical approach
Come up with methods (protocols) to ensure serializability. It’s not possible to determine
when a schedule begins and when it ends. Hence, we reduce the problem of checking the
whole schedule to checking only a committed project of the schedule

Use of locks with two phase locking

View equivalence: A less restrictive definition of equivalence of schedules

View serializability: definition of serializability based on view equivalence. A schedule


is view serializable if it is view equivalent to a serial schedule.

Two schedules are said to be view equivalent if the following three conditions hold:

1. The same set of transactions participates in S and S’, and S and S’ include the same
operations of those transactions.
2. For any operation Ri(X) of Ti in S, if the value of X read by the operation has been
written by an operation Wj(X) of Tj (or if it is the original value of X before the schedule
started), the same condition must hold for the value of X read by operation Ri(X) of Ti in
S’.
3. If the operation Wk(Y) of Tk is the last operation to write item Y in S, then Wk(Y) of
Tk must also be the last operation to write item Y in S’.

The premise behind view equivalence:

As long as each read operation of a transaction reads the result of the same write
operation in both schedules, the write operations of each transaction musr produce the
same results.

“The view”: the read operations are said to see the the same view in both schedules.

Relationship between view and conflict equivalence:


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The two are same under constrained write assumption which assumes that if T
writes X, it is constrained by the value of X it read; i.e., new X = f(old X)

Conflict serializability is stricter than view serializability. With unconstrained write


(or blind write), a schedule that is view serializable is not necessarily conflict serialiable.

Any conflict serializable schedule is also view serializable, but not vice versa.

Page 122
Advanced RDBMS

Consider the following schedule of three transactions T1: r1(X), w1(X); T2: w2(X); and
T3: w3(X):

Schedule Sa: r1(X); w2(X); w1(X); w3(X); c1; c2; c3;


In Sa, the operations w2(X) and w3(X) are blind writes, since T1 and T3 do not read the
value of X.

Sa is view serializable, since it is view equivalent to the serial schedule T1, T2, T3.
However, Sa is not conflict serializable, since it is not conflict equivalent to any serial
schedule.

Testing for conflict serializability Algorithm

1. Looks at only read_Item (X) and write_Item (X) operations


2. Constructs a precedence graph (serialization graph) - a graph
with directed edges
3. An edge is created from Ti to Tj if one of the operations in Ti
appears before a conflicting operation in Tj
4. The schedule is serializable if and only if the precedence graph has no cycles.

Constructing the precedence graphs for schedules A and D from to test for conflict
serializability
. (a) Precedence graph for serial schedule A.
(b) Precedence graph for serial schedule B.
(c) Precedence graph for schedule C (not serializable).
(d) Precedence graph for schedule D (serializable, equivalent to schedule A).

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Other Types of Equivalence of Schedules

Under special semantic constraints, schedules that are otherwise not conflict serializable
may work correctly. Using commutative operations of addition and subtraction

Page 123
Advanced RDBMS

certain non-serializable transactions may work correctly

Example: bank credit / debit transactions on a given item are separable and
commutative.

Consider the following schedule S for the two transactions:


Sh : r1(X); w1(X); r2(Y); w2(Y); r1(Y); w1(Y); r2(X); w2(X);
Using conflict serializability, it is not serializable.

However, if it came from a (read,update, write) sequence as follows:

r1(X); X := X – 10; w1(X); r2(Y); Y := Y – 20;r1(Y); Y := Y + 10; w1(Y); r2(X);


X := X + 20; (X);
Sequence explanation: debit, debit, credit, credit.

It is a correct schedule for the given semantics

3.2.10 Transaction support in SQL

A single SQL statement is always considered to be atomic. Either the statement


completes execution without error or it fails and leaves the database unchanged.

With SQL, there is no explicit Begin Transaction statement. Transaction initiation is done
implicitly when particular SQL statements are encountered.

Every transaction must have an explicit end statement, which is either a COMMIT or
ROLLBACK.

Characteristics specified by a SET

Access mode: READ ONLY or READ WRITE. The default is READ WRITE unless the
isolation level of READ UNCOMITTED is specified, in which case READ ONLY is
assumed.

Diagnostic size n, specifies an integer value n, indicating the number of conditions that
can be held simultaneously in the diagnostic area.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Characteristics specified by a SET

Isolation level <isolation>, where <isolation> can be READ UNCOMMITTED, READ


COMMITTED, REPEATABLE READ or SERIALIZABLE. The default is
SERIALIZABLE. With SERIALIZABLE: the interleaved execution of transactions will
adhere to our notion of serializability. However, if any transaction executes at a
lower level, then serializability may be violated.

Page 124
Advanced RDBMS

Potential problem with lower isolation levels:

Dirty Read: Reading a value that was written by a transaction which failed.

Nonrepeatable Read: Allowing another transaction to write a new value between


multiple reads of one transaction. A transaction T1 may read a given value from a
table. If another transaction T2 later updates that value and T1 reads that value again, T1
will see a different value. Consider that T1 reads the employee salary for Smith. Next, T2
updates the salary for Smith. If T1 reads Smith's salary again, then it will see a different
value for Smith's salary.

Phantoms: New rows being read using the same read with a condition. A transaction T1
may read a set of rows from a table, perhaps based on some condition specified in the
SQL WHERE clause. Now suppose that a transaction T2 inserts a new row that also
satisfies the WHERE clause condition of T1, into the table used by T1. If T1 is repeated,
then T1 will see a row that previously did not exist, called a phantom.

Sample SQL transaction:

EXEC SQL whenever sqlerror go to UNDO;


EXEC SQL SET TRANSACTION
READ WRITE
DIAGNOSTICS SIZE 5
ISOLATION LEVEL SERIALIZABLE;
EXEC SQL INSERT
INTO EMPLOYEE (FNAME, LNAME, SSN, DNO, SALARY)
VALUES ('Robert','Smith','991004321',2,35000);
EXEC SQL UPDATE EMPLOYEE
SET SALARY = SALARY * 1.1
WHERE DNO = 2;
EXEC SQL COMMIT;
GOTO THE_END;
UNDO: EXEC SQL ROLLBACK;
THE_END: ...

Possible violation of serializabilty: Type of Violation


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
___________________________________
Isolation Dirty nonrepeatable
level read read
phantom
_____________________ _____ _________
____________________
READ UNCOMMITTED yes yes yes
READ COMMITTED no yes yes
REPEATABLE READ no no yes
SERIALIZABLE no no no

Page 125
Advanced RDBMS

3.3 Revision points

 Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.
 Data Model Operations: Operations for specifying database retrievals and
updates by referring to the concepts of the data model. Operations on the data
model may include basic operations and user-defined operations.
 Integrated data means that the database may be thought of as a unification of
several otherwise distinct data files, with any redundancy among those files either
wholly or partly eliminated.
 An entity is any distinguishable real world object that is to be represented in the
database; each entity will have attributes or properties.
 The metadata is information about schema objects, such as tables, indexes,
views, triggers, and more.
 Sorting is one of the primary algorithms used in Query processing.
 Access to the data dictionary is allowed through numerous views, which are
divided into three categories: USER, ALL, and DBA.
 ACID properties are Atomicity, Consistency preservation, Isolation and
Durability or permanency:

3.4 Intext Questions


1. Explain the importance of Data Dictionary ?
2. Write the basic algorithm for executing query operation
3. What do you mean by Query optimization ?
4. Elucidate System Catalog ?
5. How to translate Queries into Relational Algebra ?

3.5 Summary
 The system catalog contains information about all three levels of database
schemas: external (view definitions), conceptual (base tables), and internal
(storage and index descriptions).
 SQL objects (i.e., tables, views, ...) are contained in schemas. Schemas are
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
contained in catalogs Each schema has a single owner. Objects can be referenced
with explicit or implicit catalog and schema name
 Oracle's data dictionary views are mapped onto underlying base tables, but the
views form the primary interface to Oracle's metadata. Unless you have specific
reasons to go around the views directly to the underlying base tables, you should
use the views. The views return data in a much more understandable format than
you'll get from querying the underlying tables. In addition, the views make up the
interface that Oracle documents and supports. Using an undocumented interface,
i.e. the base tables, is a risky practice

Page 126
Advanced RDBMS

 A sequence of relational algebra operations forms a relational algebra


expression, whose result will also be a relation that represents the result of a
database query.
 The ODMG-93 standard ODL and OQL languages are a technique for semantic
query optimization (SQO) for object databases.
 Semantic query optimization is the process of determining the set of semantic
transformations that results in a semantically equivalent query with a lower
execution cost.
 A transaction is an atomic unit of work that is either completed in its entirety or
not done at all. For recovery purposes, the system needs to keep track of when the
transaction starts, terminates, and commits or aborts.

3.6 Terminal Exercise

1. List the ACID Properties.


2. What are the Basic Algorithms in Query Processing?
3. A _________ is an atomic unit of work that is either completed in its entirety or
not done at all.
4. A sequence of relational algebra operations forms a __________ expression,
whose result will also be a relation that represents the result of a database query
5. What is Semantic Query Optimization?

3.7 Supplementary Materials

[Denn87b] Denning, Dorothy. E. et al., “A Multilevel Relational Data Model”. In


Proceedings IEEE Symposium on Security and Privacy, pp. 220-234,1987.

[Haig91] Haigh, J. T. et al., “The LDV Secure Relational DBMS Model,” In Database
Security, IV: Status and Prospects, S. Jajodia and C.E. Landwehr eds., pp. 265-269,
North Holland: Elsevier, 1991.

3.8 Assignment

Prepare assignment about properties of Transaction.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
3.9 Reference Books
Bloesch, A. and Halpin, T. (1996) “ConQuer: a Conceptual Query Language”
Proc.ER’96: 15th International Conference on Conceptual Modeling, Springer LNCS,
no. 1157.
Bloesch, A. and Halpin, T. (1997) “Conceptual Queries Using ConQuer-II” in. David W.
Embley, Robert C. Goldstein (Eds.): Conceptual Modeling - ER '97, 16th International
Conference on Conceptual Modeling, Los Angeles, California, USA, November 3-5,
1997, Proceedings. Lecture Notes in Computer Science 1331 Springer 1997
Elmasri, R. & Navathe, S. B. (2000). Fundamentals of Database Systems. (3rd ed.).

Page 127
Advanced RDBMS

3.10 Learning Activities

An individual or groups of peoples go to library for future activities.

3.11 Keywords
1. Data Model
2. Entity
3. Network Model
4. Data Dictionary
5. Metadata
6. Relational Algebra
7. ACID Properties
8. Semantic Query Optimization

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 128
Advanced RDBMS

UNIT – IV

Topics:
 Concurrency Control Techniques
 Locking Techniques for Concurrency Control
 Concurrency Control Based on Timestamp Ordering
 Validation Concurrency Control Techniques
 Granularity of Data Items and Multiple Granularity Locking
 Using Locks for Concurrency Control In Indexes
 Database Recovery Techniques: Recovery Concepts
 Recovery Techniques Based On Deferred Update / Immediate Update / Shadow
Paging The ARIES Recovery Algorithms
 Database Security and Authorization

4.0 Introduction
Concurrency control helps in isolation among conflicting transactions that takes part in
database management. It helps to preserve the identity of every individual record or data
and helps the database to retain consistency and ease of use that helps to promote the
reliability factor.

4.1 Objective
The objective of this unit is to learn and understand the Concurrency Control Techniques
in terms of Locking, Validation Granularity of Data Items and Multiple Granularity
Locking, Database Recovery Techniques and Database Security and Authorization:

4.2 Contents

4.2.1Concurrency Control techniques: Locking Techniques for Concurrency control

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Purpose of Concurrency Control

• To enforce Isolation (through mutual exclusion) among conflicting transactions.


• To preserve database consistency through consistency preserving execution of
transactions.
• To resolve read-write and write-write conflicts.

Example: In concurrent execution environment if T1 conflicts with T2 over a data item


A, then the existing concurrency control decides if T1 or T2 should get the A and if the
other transaction is rolled-back or waits.

Page 129
Advanced RDBMS

Two-Phase Locking Techniques

Locking is an operation which secures (a) permission to Read or (b) permission to Write
a data item for a transaction.

Example: Lock (X). Data item X is locked in behalf of the requesting transaction.

Unlocking is an operation which removes these permissions from the data item. Example:
Unlock (X).

Data item X is made available to all other transactions. Lock and Unlock are Atomic
operations.

Two-Phase Locking Techniques: Essential components

Two locks modes (a) shared (read) and (b) exclusive (write).

Shared mode: shared lock (X). More than one transaction can apply share lock on X for
reading its value but no write lock can be applied on X by any other transaction.

Exclusive mode: Write lock (X). Only one write lock on X can exist at any time and no
shared lock can be applied by any other transaction on X.

Lock Manager: Managing locks on data items.

Lock table: Lock manager uses it to store the identify of transaction locking a data item,
the data item, lock mode and pointer to the next data item locked. One simple way to
implement a lock table is through linked list .

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Database requires that all transactions should be wellformed.

A transaction is well-formed if:

• It must lock the data item before it reads or writes to it.


• It must not lock an already locked data items and it must not try to unlock a free data
item.

Page 130
Advanced RDBMS

The following code performs the lock operation:

B: if LOCK (X) =0
(*item is unlocked*)
then LOCK (X) (*lock the item*)
else begin
wait (until lock (X) =0) and the lock manager wakes up the transaction);
goto B
end;

The following code performs the unlock operation:

LOCK (X) <-(*unlock the item*)


if any transactions are waiting then wake up one of the waiting the transactions;

The following code performs the read / write operation:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 131
Advanced RDBMS

Two-Phase Locking Techniques: The algorithm

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 132
Advanced RDBMS

Two Phases: (a) Locking (Growing) (b) Unlocking (Shrinking).

Locking (Growing) Phase: A transaction applies locks (read or write) on desired data
items one at a time.

Unlocking (Shrinking) Phase: A transaction unlocks its locked data items one at a time.

Requirement: For a transaction these two phases must be mutually exclusively, that is,
during locking phase unlocking phase must not start and during unlocking phase locking
phase must not begin.

To guarantee serializability, in a transaction, all lock operations (S_Lock or X_Lock)


precede the first unlock operations. No Locks can be acquired after the first lock is
released. We call a transaction satisfying two-phase locking protocol, if it obeys the
above rules. The two-phase execution involves: Growing Phase – Lock acquisition only
(no lock) and Shrinking Phase – Lock release (no more lock). A lock point divides the
two phases as shown below:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 133
Advanced RDBMS

Two-phase policy generates two locking algorithms (a) Basic and (b) Conservative.

Conservative: Prevents deadlock by locking all desired data items before transaction
begins execution.

Basic: Transaction locks data items incrementally. This may cause deadlock which is
ANNAMALAI
ANNAMALAI UNIVERSITY
dealt with. UNIVERSITY
Strict: A more stricter version of Basic algorithm where unlocking is performed after a
transaction terminates (commits or aborts and rolledback).
This is the most commonly used two-phase locking algorithm.

Dealing with Deadlock and Starvation

Deadlock T’1 T’2


read_lock (Y); T1 and T2 did follow two-phase
read_item (Y); policy but they are deadlock read_lock (X); read_item (Y);

Page 134
Advanced RDBMS

write_lock (X); (waits for X) write_lock (Y);(waits for Y)

Deadlock (T’1 and T’2)

Deadlock prevention

A transaction locks all data items it refers to before it begins execution. This way of
locking prevents deadlock since a transaction never waits for a data item. The
conservative two-phase locking uses this approach.
Deadlock occurs when each of two transactions is waiting for the other to release the lock
on an item.

In general a deadlock may involve n (n>2) transactions, and can be detected by using a
wait-for graph.

Deadlock detection and resolution

In this approach, deadlocks are allowed to happen. The scheduler maintains a wait-for-

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
graph for detecting cycle. If a cycle exists, then one transaction involved in the cycle is
selected(victim) and rolledback.

A wait-for-graph is created using the lock table. As soon as a transaction is blocked, it is


added to the graph. When a chain like: Ti waits for Tj waits for Tk waits for Ti or Tj
occurs, then this creates a cycle. One of the transaction of the cycle is selected and rolled
back.

Page 135
Advanced RDBMS

Deadlock avoidance

There are many variations of two-phase locking algorithm. Some avoid deadlock by not
letting the cycle to complete. That is as soon as the algorithm discovers that blocking a
transaction is likely to create a cycle, it rolls back the transaction. Wound-Wait and
Wait-Die algorithms use time stamps to avoid deadlocks by rolling-back victim. For
example:

Deadlock can be considered as – A cycle in the Wait-for Graph. A Deadlock is broken by


rolling back any one of the transactions causing deadlock.

Dealing with deadlocks involve:


1. Deadlock Detection and resolution which comprises of the following steps:
a. Construct the wait-for graph.
b. Periodic Checks for deadlocks using graph algorithms based on – waiting
time of transactions and number of concurrent transactions.
c. Actions to be taken in case of occurrence of deadlocks which could be
selecting victims and aborting and watchout for starvation.

2. Deadlock avoidance is of two ways such as:


a. Acquiring all the locks at once (Less concurrency).
b. Acquiring the locks in pre-fixed order (Cannot go back)

Starvation
Starvation occurs when a particular transaction consistently waits or restarted and never
gets a chance to proceed further. In a deadlock resolution it is possible that the same
transaction may consistently be selected as victim and rolled-back. This limitation is
inherent in all priority based scheduling mechanisms. In Wound-Wait scheme a
younger transaction may always be wounded (aborted) by a long running older
transaction which may create starvation.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Deadlock Avoidance Strategies

These include:
1. No waiting: - if no lock abort and restart without waiting for deadlock.
2. Cautious waiting – waiting for a lock to be obtained else abort
3. Based on timeouts: Long waits are assumed as deadlocks and aborted.

4.2.2 Concurrency control based on Timestamp Ordering


Timestamp
A monotonically increasing variable (integer) indicating the age of an operation or a
transaction. A larger timestamp value indicates a more recent event or operation.

Page 136
Advanced RDBMS

Timestamp based algorithm uses timestamp to serialize the execution of concurrent


transactions.

Basic Timestamp Ordering

1. Transaction T issues a write_item(X) operation:

a. If read_TS(X) >TS(T) or if write_TS(X) > TS(T), then an younger transaction has


already read the data item so abort and roll-back T and reject the operation.

b. If the condition in part (a) does not exist, then execute write_item(X) of T and set
write_TS(X) to TS(T).

2. Transaction T issues a read_item(X) operation:

a. If write_TS(X) > TS(T), then an younger transaction has already written to the data
item so abort and roll-back T and reject the operation.

b. If write_TS(X) <=TS(T), then execute read_item(X) of T and set read_TS(X) to the


larger of TS(T) and the current read_TS(X).

Strict Timestamp Ordering

1. Transaction T issues a write_item(X) operation:

a. If TS(T) > read_TS(X), then delay T until the transaction T’ that wrote or read X has
terminated (committed or aborted).

2. Transaction T issues a read_item(X) operation:

a. If TS(T) > write_TS(X), then delay T until the transaction T’ that wrote or read X has
terminated (committed or aborted).

Thomas’s Write Rule

1. If read_TS(X) > TS(T) then abort and roll-back T and reject the operation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
2. If write_TS(X) > TS(T), then just ignore the write operation and continue execution.
This is because the most recent writes counts in case of two consecutive writes.

3. If the conditions given in 1 and 2 above do not occur, then execute write_item(X) of T
and set write_TS(X) to TS(T).

Page 137
Advanced RDBMS

4.2.3 Multiversion concurrency control Techniques

This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction. Thus unlike other mechanisms a read
operation in this mechanism is never rejected.

Side effect: Significantly more storage (RAM and disk) is required to maintain multiple
versions. To check unlimited growth of versions, a garbage collection is run when some
criteria is satisfied.

Multiversion technique based on timestamp ordering

This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction. Thus unlike other mechanisms a read
operation in this mechanism is never rejected.

Side effects: Significantly more storage (RAM and disk) is required to maintain multiple
versions. To check unlimited growth of versions, a garbage collection is run when some
criteria is satisfied. Assume X1, X2, …, Xn are the version of a data item X created
by a write operation of transactions. With each Xi a read_TS (read timestamp) and a
write_TS (write timestamp) are associated.

read_TS(Xi): The read timestamp of Xi is the largest of all the timestamps of


transactions that have successfully read version Xi.

write_TS(Xi): The write timestamp of Xi that wrote the value of version Xi. A new
version of Xi is created only by a write operation.

To ensure serializability, the following two rules are used. If transaction T issues
write_item (X) and version i of X has the highest write_TS(Xi) of all versions of X that is
also less than or equal to TS(T), and read _TS(Xi) > TS(T), then abort and rollback
T; otherwise create a new version Xi and read_TS(X) = write_TS(Xj) = TS(T).
If transaction T issues read_item (X), find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T), then return the
value of Xi to T, and set the value of read _TS(Xi) to the largest of TS(T) and the current
read_TS(Xi).
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To ensure serializability, the following two rules are used.

1. If transaction T issues write_item (X) and version i of X has the highest write_TS(Xi)
of all versions of X that is also less than or equal to TS(T), and read _TS(Xi) > TS(T),
then abort and roll-back T; otherwise create a new version Xi and
read_TS(X) = write_TS(Xj) = TS(T).

2. If transaction T issues read_item (X), find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T), then return the

Page 138
Advanced RDBMS

value of Xi to T, and set the value of read _TS(Xi) to the largest of TS(T) and the current
read_TS(Xi).

Rule 2 guarantees that a read will never be rejected.

Multiversion Two-Phase Locking Using Certify Locks

Allow a transaction T’ to read a data item X while it is write locked by a conflicting


transaction T. This is accomplished by maintaining two versions of each data item X
where one version must always have been written by some committed transaction. This
means a write operation always creates a new version of X.

Steps
1. X is the committed version of a data item.
2. T creates a second version X’ after obtaining a write lock on X.
3. Other transactions continue to read X.
4. T is ready to commit so it obtains a certify lock on X’.
5. The committed version X becomes X’.
6. T releases its certify lock on X’, which is X now.

Compatibility tables for


Read Write
yes no
no no
Read Write
Read Write Certify
yes no no
no no no
no no no
Read Write Certify
read/write locking scheme read/write/certify locking scheme

Note

In multiversion 2PL read and write operations from conflicting transactions can be
processed concurrently. This improves concurrency but it may delay transaction commit
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
because of obtaining certify locks on all its writes. It avoids cascading abort but like strict
two phase locking scheme conflicting transactions may get deadlocked.

4.2.4 Validation (Optimistic) Concurrency Control Techniques

In this technique only at the time of commit serializability is checked and transactions are
aborted in case of non-serializable schedules.

Three phases:

Page 139
Advanced RDBMS

Read phase: A transaction can read values of committed data items. However, updates
are applied only to local copies (versions) of the data items (in database cache).

Validation phase: Serializability is checked before transactions write their updates to the
database.

This phase for Ti checks that, for each transaction Tj that is either committed or is in its
validation phase, one of the following conditions holds:

1. Tj completes its write phase before Ti starts its read phase.

2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no
items in common with the write_set of Tj

3. Both the read_set and write_set of Ti have no items in common with the write_set of
Tj, and Tj completes its ead phase.

When validating Ti, the first condition is checked first for each transaction Tj, since
(1) is the simplest condition to check.

If (1) is false then (2) is checked and if (2) is false then (3 ) is checked. If none of these
conditions holds, the validation fails and Ti is aborted.

Write phase: On a successful validation transactions’ updates are applied to the database;
otherwise, transactions are restarted.

4.2.5 Granularity of data items and Multiple Granularity Locking

A lockable unit of data defines its granularity. Granularity can be coarse (entire database)
or it can be fine Data item granularity significantly affects concurrency control
performance. Thus, the degree of concurrency is low for coarse granularity and high for
fine granularity. Example of data item granularity:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 140
Advanced RDBMS

1.
A field of a database record (an attribute of a tuple).
2. A database record (a tuple or a relation).
3. A disk block.
4. An entire file.
5. The entire database.

To manage such hierarchy, in addition to read and write, three additional locking modes,
called intention lock modes are defined:

Intention-shared (IS): indicates that a shared lock(s) will be requested on some


descendent nodes(s).

Intention-exclusive (IX): indicates that an exclusive lock(s) will be requested on some


descendent nodes(s).

Shared-intention-exclusive (SIX): indicates that the current node is locked in shared


mode but an exclusive lock(s) will be requested on some descendent nodes(s).

These locks are applied using the following compatibility matrix:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The set of rules which must be followed for producing serializable schedule are

1. The lock compatibility must adhered to.

2. The root of the tree must be locked first, in any mode..

Page 141
Advanced RDBMS

3. A node N can be locked by a transaction T in S or IX mode only if the parent node is


already locked by T in either IS or IX mode.

4. A node N can be locked by T in X, IX, or SIX mode only if the parent of N is already
locked by T in either IX or SIX mode.

5. T can lock a node only if it has not unlocked any node (to enforce 2PL policy).

6. T can unlock a node, N, only if none of the children of N are currently locked by T.

Using Locks for concurrency control in Indexes

Real-time database systems are expected to rely heavily on indexes to speed up data
access and, thereby, help more transactions meet their deadlines. Accordingly, high-
performance index concurrency control (ICC) protocols are required to prevent
contention for the index from becoming a bottleneck. A new real-time ICC protocol
called GUARD-link augments the classical B-link protocol with a feedback-based
admission control mechanism and also supports both point and range queries, as well as
the undos of the index actions of aborted transactions. The performance metrics used in
evaluating the ICC protocols are the percentage of transactions that miss their deadlines
and the fairness with respect to transaction type and size.

The performance characteristics of the real-time version of an ICC protocol could be


significantly different from the performance of the same protocol in a conventional
(nonreal-time) database system. In particular, B-link protocols, which are reputed to
provide the best overall performance in conventional database systems, perform poorly
under heavy real-time loads. The new GUARD-link protocol, however, although based
on the B-link approach, delivers the best performance (with respect to all performance
metrics) for a variety of real-time transaction workloads, by virtue of its admission
control mechanism.

4.2.6 Database Recovery Techniques : Recovery Concepts

The Database can be updated immediately, but an update operation must be recorded in
the log before it is applied to the database.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In a single-user system, if a failure occurs, it undone all operations

When concurrent execution is permitted, the recovery process depends on the protocols
used for concurrency control. For example, a strict two phase locking protocol does not
allow a transaction to read or write an item unless the transaction that last wrote the item
has committed

Database recovery refers to the Process of restoring database to a correct state in the
event of a failure. The Need for Recovery Control involves:
• Two types of storage: volatile (main memory) and nonvolatile.

Page 142
Advanced RDBMS

• Volatile storage does not survive system crashes.


• Stable storage represents information that has been replicated in several nonvolatile
storage media with independent failure modes.

Failure types

The Failure types could be different based on - System crashes, resulting in loss of main
memory, Media failures, resulting in loss of parts of secondary storage, Application
software errors, Natural physical disasters, Carelessness or unintentional destruction of
data or facilities and Sabotage.

A good DBMS should provide following facilities to assist with recovery:


• Backup mechanism, which makes periodic backup copies of database.
• Logging facilities, which keep track of current state of transactions and database
changes.
• Checkpoint facility, which enables updates to database in progress to be made
permanent.
• Recovery manager, which allows DBMS to restore the database to a consistent state
following a failure.

A Log file contains information about all updates to database:


• Transaction records.
• Checkpoint records.
Transaction records contain:
• Transaction identifier.
• Type of log record, (transaction start, insert, update, delete, abort, commit).
• Identifier of data item affected by database action (insert, delete, and update
operations).
• Before-image of data item.
• After-image of data item.
• Log management information.

Checkpoint is defined as a Point of synchronization between database and log file. All
buffers are force-written to secondary storage. Checkpoint record is created containing
identifiers of all active transactions. When failure occurs, redo all transactions that
committed since the checkpoint and undo all transactions active at time of crash.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If database has been damaged, there is a Need to restore last backup copy of database and
reapply updates of committed transactions using log file. If database is only inconsistent,
there is a Need to undo changes that caused inconsistency. This may also need to redo
some transactions to ensure updates reach secondary storage. This does not need backup,
but can restore database using before and after-images in the log file.

Main Recovery Techniques


Three main recovery techniques:
• Deferred Update

Page 143
Advanced RDBMS

• Immediate Update
• Shadow Paging.

Deferred Updates
• Updates are not written to the database until after a transaction has reached its commit
point.
• If transaction fails before commit, it will not have modified database and so no undoing
of changes required.
• May be necessary to redo updates of committed transactions as their effect may not
have reached database.

Immediate Updates
• Updates are applied to database as they occur.
• Need to redo updates of committed transactions following a failure.
• May need to undo effects of transactions that had not committed at time of failure.
• Essential that log records are written before write to database called as - Write-ahead
log protocol.
• If no "transaction commit" record in log, then that transaction was active at failure and
must be undone.
• Undo operations are performed in reverse order in which they were written to log.

Shadow Paging

• Maintain two page tables during life of a transaction: current page and shadow page
table.
• When transaction starts, two pages are the same.
• Shadow page table is never changed thereafter and is used to restore database in event
of failure.
• During transaction, current page table records all updates to database.
• When transaction completes, current page table becomes shadow page table.

This recovery scheme does not require the use of a log in a single-user environment. In a
multiuser environment, a log may be needed for the concurrency control method.

When a transaction begins executing, the current directory, whose entries point to the
most recent or current database pages on disk, is copied into a shadow directory. The
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
shadow directory is then saved on disk while the current directory is used by the
transaction.

When a write item operation is performed, a new copy of the modified database page is
created.

To recovery from a failure during transaction execution, it is sufficient to free the


modified database pages and to discard the current directory.

Page 144
Advanced RDBMS

4.2.7 The ARIES Recovery Algorithm

The ARIES Recovery Algorithm is based on:

1. WAL (Write Ahead Logging)

2. Repeating history during redo: ARIES will retrace all actions of the database system
prior to the crash to reconstruct the database state when the crash occurred.

3. Logging changes during undo: It will prevent ARIES from repeating the completed
undo operations if a failure occurs during recovery, which causes a restart of the recovery
process.

The ARIES recovery algorithm consists of three steps:

1. Analysis: step identifies the dirty (updated) pages in the buffer and the set of
transactions active at the time of crash. The appropriate point in the log where redo is to
start is also determined.

2. Redo: necessary redo operations are applied.


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
3. Undo: log is scanned backwards and the operations of transactions active at the time of
crash are undone in reverse order.

The Log and Log Sequence Number (LSN)

A log record stores:

1. Previous LSN of that transaction: It links the log record of each transaction. It is like a
back pointer points to the previous record of the same transaction.
2. Transaction ID

Page 145
Advanced RDBMS

3. Type of log record.

For a write operation the following additional information is logged:

4. Page ID for the page that includes the item

5. Length of the updated item

6. Its offset from the beginning of the page

7. BFIM of the item

8. AFIM of the item

The Transaction table and the Dirty Page table

For efficient recovery following tables are also stored in the log during checkpointing:

Transaction table: Contains an entry for each active transaction, with information such
as transaction ID, transaction status and the LSN of the most recent log record for the
transaction.

Dirty Page table: Contains an entry for each dirty page in the buffer, which includes the
page ID and the LSN corresponding to the earliest update to that page.

Checkpointing

A checkpointing does the following:

1. Writes a begin_checkpoint record in the log

2. Writes an end_checkpoint record in the log. With this record the contents of transaction
table and dirty page table are appended to the end of the log.

3. Writes the LSN of the begin_checkpoint record to a special file. This special file is
accessed during recovery to locate the last checkpoint information.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To reduce the cost of checkpointing and allow the system to continue to execute
transactions, ARIES uses “fuzzy checkpointing”.

The following steps are performed for recovery

1. Analysis phase: Start at the begin_checkpoint record and proceed to the


end_checkpoint record. Access transaction table and dirty page table are appended to the
end of the log. Note that during this phase some other log records may be written to the

Page 146
Advanced RDBMS

log and transaction table may be modified. The analysis phase compiles the set of redo
and undo to be performed and ends.

2. Redo phase: Starts from the point in the log up to where all dirty pages have been
flushed, and move forward to the end of the log. Any change that appears in the dirty
page table is redone.

3. Undo phase: Starts from the end of the log and proceeds backward while performing
appropriate undo. For each undo it writes a compensating record in the log.
The recovery completes at the end of undo phase

An example of the working of ARIES scheme

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
4.2.8 Recovery In Multi Database System

A multidatabase system is a special distributed database system where one node may be
running relational database system under Unix, another may be running object-oriented
system under window and so on. A transaction may run in a distributed fashion at
multiple nodes. In this execution scenario the transaction commits only when all these
multiple nodes agree to commit individually the part of the transaction they were
executing. This commit scheme is referred to as “two-phase commit” (2PC). If any
one of these nodes fails or cannot commit the part of the transaction, then the transaction
is aborted. Each node recovers the transaction under its own recovery protocol.

Page 147
Advanced RDBMS

In some cases a single transaction (called a multidatabase transaction) may require access
to multiple database.

To maintain the atomicity of a multidatabase transaction, it is necessary to have a two


level recovery mechanism – a global recovery manager or coordinator is needed

Phase -1 : When all participating databases signal the coordinator that the part of the
multidatabase transaction involving each has concluded, the coordinator sends a message
“ prepare for commit”, participating databases send OK, according to the result of their
force-write.

Phase-2: If all participating databases reply “OK” the transaction is successful and the
coordinator sends a “commit” signal to the participating databases

Database Backup and Recovery from Catastrophic Failures

A key assumption has been that the system log is maintained on the disk and is not lost
as a result of the failure.

The recovery manager of a DBMS must also be equipped to handle more catastrophic
failures such as disk crashes.

The main technique used to handle such cases is that of database backup . The whole
database and the log are periodically copied onto a cheap storage medium such as
magnetic tapes.

Database Security and Authorization : Database Security Issue

a. Security Issues - Access Controls

The most common form of access control in a relational database is the view (for a
detailed discussion of relational databases, see [RobCor93]). The view is a logical table,
which is created with the SQL VIEW command.

This table contains data from the database obtained by additional SQL commands such as
JOIN and SELECT. If the database is unclassified, the source for the view is the entire
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
database. If, on the other hand, the database is subject to multilevel classification, then
the source for the view is that subset of the database that is at or below the classification
level of the user. Users can read or modify data in their view, but the view prohibits users
from accessing data at a classification level above their own. In fact, if the view is
properly designed, a user at a lower classification level will be unaware that data exists at
a higher classification level [Denn87a].

In order to define what data can be included in a view source, all data in the database
must receive an access classification. Denning [Denn87a] lists several potential access
classes that can be applied.

Page 148
Advanced RDBMS

These include:

(1) Type dependent: Classification is determined based on the attribute associated with
the data.

(2) Value dependent: Classification is determined based on the value of the data.

(3) Source level: Classification of the new data is set equivalent to the classification of
the data source.

(4) Source label: The data is arbitrarily given a classification by the source or by the user
who enters the data.

Classification of data and development of legal views become much more complex when
the security goal includes the reduction of the threat of inference attacks. Inference is
typically made from data at a lower classification level that has been derived from higher
level data. The key to this relationship is the derivation rule, which is defined as the
operation that creates the derived data (for example, a mathematical equation). A
derivation rule also specifies the access class of the derived data. To reduce the potential
for inference, however, the data elements that are inputs to the derivation must be
examined to determine whether one or more of these elements are at the level of the
derived data. If this is the case, no inference problem exists. If, however, all the elements
are at a lower level than the derived data, then one or more of the derivation inputs must
be promoted to a higher classification level [Denn87a].

The use of classification constraints to counter inference, beyond the protections provided
by the view, requires additional computation. Thuraisingham and Ford [ThurFord95]
discuss one way that constraint processing can be implemented. In their model,
constraints are processed in three phases. Some constraints are processed during design
(these may be updated later), others are processed when the database is queried to
authorize access and counter inference, and many are processed during the update phase.
Their strategy relies on two inference engines, one for query processing and one for
update processing. Essentially, the inference engines are middlemen, which operate
between the DBMS and the interface (see figure 1). According to Thuraisingham and
Ford, the key to this strategy is the belief that most inferential attacks will occur as a
result of summarizing a series of queries (for example, a statistical inference could be
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
made by using a string of queries as a sample) or by interpreting the state change of
certain variables after an update.

The two inference engines work by evaluating the current task according to a set of rules
and determining a course of action. The inference engine for updates dynamically revises
the security constraints of the database as the security conditions of the organization
change and as the security characteristics of the data stored in the database change. The
inference engine for query processing evaluates each entity requested in the query, all the
data released in a specific period that is at the security level of the current query, and
relevant data available externally at the same security level. This is called the knowledge

Page 149
Advanced RDBMS

base. The processor evaluates the potential inferences from the union of the knowledge
base and the query’s potential response. If the user’s security level dominates the security
levels of all of the potential inferences, the response is allowed [ThurFord95].

b. Security Issues -Integrity

The integrity constraints in the relational model can be divided into two categories:

(1) implicit constraints and (2) explicit constraints.

Implicit constraints which include domain, relational, and referential constraints enforce
the rules of the relational model.

Explicit constraints enforce the rules of the organization served by the DBMS. As such,
explicit constraints are one of the two key elements (along with views) of security
protection in the relational model [BellGris92].

Accidental or deliberate modification of data can be detected by explicit constraints.


Pfleeger [Pflee89] lists several error detection methods, such as parity checks, that can be
enforced by explicit constraints. Earlier we discussed local integrity constraints (section
2.2.). These constraints are also examples of explicit constraints.

Multilevel classification constraints are another example. A final type of explicit


constraint enforces polyinstantiation integrity.

Polyinstantiation refers to the replication of a tuple in a multilevel access system. This


occurs when a user at a lower level L2 enters a tuple into the database which has the same
key as a tuple which is classified at a higher level L1 (L1 > L2). The DBMS has two
options. It can refuse the entry, which implies that a tuple with the same key exists at L1
or it can allow the entry. If it allows the entry, then two tuples with identical keys exist in
the database. This condition is called polyinstantiation [Haig91]. Obvious integrity
problems can result. The literature contains several algorithms for ensuring
polyinstantiation integrity.

Typically, explicit constraints are implemented using the SQL ASSERT or TRIGGER
commands. ASSERT statements are used to prevent an integrity violation. Therefore,
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
they are applied before an update. The TRIGGER is part of a response activation
mechanism. If a problem with the existing database is detected (for example, an error is
detected after a parity check), then a predefined action is initiated [BellGris92]. More
complicated explicit constraints like multilevel classification constraints require
additional programming with a 3GL. This is the motivation for the constraint processor.
So, SQL and, consequently, the relational model alone cannot protect the database from
determined inferential attack

Page 150
Advanced RDBMS

4.2.9 Object-oriented Database Security

Object-oriented Databases
While it is not the intent of this paper to present a detailed description of the object-
oriented model, the reader may be unfamiliar with the elements of a object-oriented
database. For this reason, we will take a brief look at the object-oriented model's basic
structure. For a more detailed discussion, the interested reader should see [Bert92,
Stein94, or Sud95].

The basic element of an object-oriented database is the object. An object is defined by a


class. In essence, classes are the blueprints for objects. In the object-oriented model,
classes are arranged in a hierarchy. The root class is found at the top of the hierarchy.
This is the parent class for all other classes in the model. We say that a class that is the
descendent from a parent inherits the properties of the parent class. As needed, these
properties can be modified and extended in the descendent class [MilLun92].

An object is composed of two basic elements:

variables and methods.

An object holds three basic variables types:

(1) Object class: This variable keeps a record of the parent class that defines the object.

(2) Object ID (OID):


A record of the specific object instance. The OID is also kept in an OID table. The OID
table provides a map for finding and accessing data in the object-oriented database. As
we will see, this also has special significance in creating a secure database.

(3) Data stores: These variables store data in much the same way that attributes store data
in a relational tuple [MilLun92].

Methods are the actions that can be performed by the object and the actions that can be
performed on the data stored in the object variables. Methods perform two basic
functions: They communicate with other objects and they perform reads and updates on
the data in the object. Methods communicate with other objects by sending messages.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
When a message is sent to an object, the receiving object creates a subject. Subjects
execute methods; objects do not. If the subject has suitable clearance, the message will
cause the subject to execute a method in the receiving object. Often, when the action at
the called object ends, the subject will execute a method that sends a message to the
calling object indicating that the action has ended [MilLun92].

Methods perform all reading and writing of the data in an object. For this reason, we say
that the data is encapsulated in the object. This is one of the important differences
between object-oriented and relational databases [MilLun92]. All control for access,

Page 151
Advanced RDBMS

modification, and integrity start at the object level. For example, if no method exists for
updating a particular object's variable, then the value of that variable is constant. Any
change in this condition must be made at the object level.

Access Controls

As with the relational model, access is controlled by classifying elements of the database.
The basic element of this classification is the object. Access permission is granted if the
user has sufficient security clearance to access the methods of an object. Millen and Lunt
[MilLun92] describe a security model that effectively explains the access control
concepts in the object-oriented model. Their model is based on six security properties:

Property 1 (Hierarchy Property). The level of an object must dominate that of its class
object.

Property 2 (Subject Level Property). The security level of a subject dominates the level
of the invoking subject and it also dominates the level of the home object.

Property 3 (Object Locality Property). A subject can execute methods or read or write
variables only in its home object.

Property 4 (*-Property) A subject may write into its home object only if its security is
equal to that of the object.

Property 5 (Return value property) A subject can send a return value to its invoking
subject only if it is at the same security level as the invoking subject.

Property 6 (Object creation property) The security level of a newly-created object


dominates the level of the subject that requested the creation [MilLun92].

Property 1 ensures that the object that inherits properties from its parent class has at least
the same classification level as the parent class. If this were not enforced, then users
could gain access to methods and data for which they do not have sufficient clearance.

Property 2 ensures that the subject created by the receiving object has sufficient clearance
to execute any action from that object. Hence, the classification level given to the subject
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
must be equal to at least the highest level of the entities involved in the action.

Property 3 enforces encapsulation. If a subject wants to access data in another object, a


message must be sent to that object where a new subject will be created. Property 6 states
that new objects must have at least as high a clearance level as the subject that creates the
object. This property prevents the creation of a covert channel.

Properties 4 and 5 are the key access controls in the model.

Page 152
Advanced RDBMS

Property 4 states that the subject must have sufficient clearance to update data in its home
object. If the invoking subject does not have as high a classification as the called object's
subject, an update is prohibited.

Property 5 ensures that if the invoking subject from the calling object does not have
sufficient clearance, the subject in the called object will not return a value.

The object-oriented model and the relational model minimize the potential for inference
in a similar manner. Remaining consistent with encapsulation, the classification
constraints are executed as methods. If a potential inference problem exists, access to a
particular object is prohibited [MilLun92].

Integrity

As with classification constraints, integrity constraints are also executed at the object
level [MilLun92]. These constraints are similar to the explicit constraints used in the
relational model. The difference is in execution. An object-oriented database maintains
integrity before and after an update by executing constraint checking methods on the
affected objects. As we saw in section 4.1.2., a relational DBMS takes a more global
approach.

One of the benefits of encapsulation is that subjects from remote objects do not have
access to a called object's data. This is a real advantage that is not present in the
relational DBMS. Herbert [Her94] notes that an object oriented system derives a
significant benefit to database integrity from encapsulation. This benefit stems from
modularity. Since the objects are encapsulated, an object can be changed without
affecting the data in another object. So, the process that contaminated one element is less
likely to affect another element of the database.

4.2.10 Object-Oriented Database Security Problems in the Distributed Environment

Sudama [Sud95] states that there are many impediments to the successful implementation
of a distributed object-oriented database. The organization of the object-oriented
DDBMS is more difficult than the relational DDBMS. In a relational DDBMS, the role of
client and server is maintained. This makes the development of multilevel access controls
easier. Since the roles of client and server are not well defined in the object-oriented
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
model, control of system access and multilevel access is more difficult.

System access control for the object-oriented DDBMS can be handled at the host site in a
procedure similar to that described for the relational DDBMS. Since there is no clear
definition of client and server, however, the use of replicated multisite approval would be
impractical.

Multilevel access control problems arise when developing effective and efficient
authorization algorithms for subjects that need to send messages to multiple objects
across several geographically separate locations. According to Sudama [Sud95], there are

Page 153
Advanced RDBMS

currently no universally accepted means for enforcing subject authorization in a pure


object-oriented distributed environment. This means that, while individual members have
developed there own authorization systems, there is no pure object-oriented vendor-
independent standard which allows object-oriented database management systems
(OODBMS) from different vendors (a heterogeneous distributed system) to communicate
in a secure manner. Without subject authorization, the controls described in the previous
section cannot be enforced. Since inheritance allows one object to inherit the properties
of its parent, the database is easily compromised. So, without effective standards, there is
no way to enforce multilevel classification.

Sudama [Sud95] notes that one standard does exist, called OSF DCE (Open Software
Foundation's Distributed Computing Environment), that is vendor-independent, but is
not strictly an object-oriented database standard.

While it does provide subject authorization, it treats the distributed object environment as
a client/server environment as is done in the relational model. He points out that this
problem may be corrected in the next release of the standard.

The major integrity concern in a distributed environment that is not a concern in the
centralized database is the distribution of individual objects. Recall that a RDBMS allows
the fragmentation of tables across sites in the system. It is less desirable to allow the
fragmentation of objects because this can violate encapsulation. For this reason,
fragmentation should be explicitly prohibited with an integrity constraint [Her94]

The DBA has a DBA account in the DBMS, Which provides powerful
capabilities that are not made available to regular database accounts and users.

DBA account can be used to perform the following types of actions :

1. Account creation - creates a new account and password for a user or a group of
users.

2. Privilege granting – permits the DBA to grant certain privilege to certain


accounts

3. Privilege revocation – permits the DBA to revoke certain privileges that were
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
previously given to certain accounts.

4. Security level assignment- assigning user accounts to the appropriate security


classification level.

The DBA is fully responsible for the overall security of the system.

Page 154
Advanced RDBMS

Discretionary Access Control Based on Granting / Revoking of Privileges

The typical method is based on the granting and revoking of privileges.


Two levels for assigning privileges to use the database system :

1. The account level- the DBA specifies the particular privileges that each
account holds independently of the relations in the database
( Create TABLE, Create VIEW, Drop privilege)

2. The relation level – control the privilege to access each individual relation or
view in the database (Generally known as the access matrix model, where the
rows are subjects – users, account, programs – and the columns are objects –
relations, records, columns, views, operations)

In SQL the following types of privileges can be granted on each individual


relation R:

- Select - gives the account retrieval privilege.


- Modify – gives the account the capability to modify tuples. Of R
- References – gives the account the capability to reference relation R
when specifying integrity constraints.
- The view mechanism is an important discretionary authorization
mechanism.

Example:

DBA can issue

GRANT CREATETAB TO ACC!;


CREATE SCHEMA COMPANY AUTHORIZATION ACC1;

Next Acc1 can issue

GRANT INSERT DELETE ON EMPLOYEE, DEPARTMENT TO ACC2;

Next Acc1 can issue


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
GRANT SELECT ON EMPLOYEE, DEPARTMENT TO ACC3
WITH GRANT OPTION;

Now ACC3 can issue

GRANT SELECT ON EMPLOYEE TO ACC4;

Now ACC1 can issue

Page 155
Advanced RDBMS

REVOKE SELECT ON EMPLOYEE FROM ACC3;


ACC1 also can issue

CREATE VIEW EMPLOYEE AS


SELECT NAME, BDATE, ADDRESS
FROM EMPLOYEE WHERE DNO=20;
GRANT SELECT ON EMPLOYEE TO ACC3 WITH GRANT OPTION;

Finally ACC1 can issue

GRANT UPDATE ON EMPLOYEE (SALARY) TO ACC4;

Mandatory Access Control For Multilevel Security

The discretionary access control technique of granting and revoking privileges is an all –
or-nothing method.

The need for multilevel security exists in Government, Industry and corporate
applications

Typical security classes are top secret( TS), Secret(s), Confidential(c) and unclassified
(u), where TS>S>C>U

Two restrictions on data access based on the subject / object (s/o classifications

1. A subject S is not allowed read access to an object O unless class


(s)>class(o)

2. A subject S is not allowed to write an object O unless class (s)


<class(O)

Statistical Database Security

Statistical databases are used mainly to produce statistics on various populations.

A population is a set of tuples of a relation that satisfy some selection condition.


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Statistical database security techniques must prevent the retrieval of individual dataIn
some cases it may be possible to infer the values of individual tuples from a sequence of
statistical queries.

The possibility of inferring individual information from statistical queries is reduced if no


statistical queries are permitted whenever the number of tuples in the population specified
by the selection condition falls below some threshold

Page 156
Advanced RDBMS

4.3 Revision Points


o Concurrency control effects

 To enforce Isolation (through mutual exclusion) among conflicting


transactions.
 To preserve database consistency through consistency preserving execution of
transactions.
 To resolve read-write and write-write conflicts.

o Two-Phase Locking Techniques

Two modes are available a) Shared and b) Exclusive

o Deadlock situation

It is a situation where in which the two transactions wait for each other to perform
the operation and release the lock for a particular item.

o Starvation

This occurs when a specific transaction waits or restarted and never gets a chance
to proceed further.

o Time Stamp

The repeated work on a particular variable indicates the age of an operation or a


transaction. The larger timestamp value indicates that it is the latest operation.

o Granularity of DATA items

A lockable unit of data defines its granularity. Granularity can be a database or a


record, which is processed. The granularity affects concurrency control
performance.

o
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Shadow Paging

When a transaction begins executing, the current directory, whose entries point to
the most recent or current database pages on disk, is copied into a directory
known as shadow directory.

In order to recovery from a failure during transaction execution, it is sufficient to


free the modified database pages and to discard the current directory.

Page 157
Advanced RDBMS

o The account level- the DBA specifies the particular privileges that each
account holds independently of the relations in the database
( Create TABLE, Create VIEW, Drop privilege)

o The relation level – control the privilege to access each individual relation or
view in the database (Generally known as the access matrix model, where the
rows are subjects – users, account, programs – and the columns are objects –
relations, records, columns, views, operations)

4.4 Intext questions


1. By giving example illustrate the Aries Algorithm ?
2. What is a Time Stamp ?
3. Write a note on concurrency control techniques ?
4. Give a brief account of Deadlock situation ?
5.. What are the validation concurrency control techniques ?
6. Discuss about the Security issues in Object Oriented Databases ?

4.5 Summary
Concurrency control helps in isolation among conflicting transactions that takes part in
database management.

In multiversion 2PL read and write operations from conflicting transactions can be
processed concurrently. This improves concurrency but it may delay transaction commit
because of obtaining certify locks on all its writes. It avoids cascading abort but like strict
two phase locking scheme conflicting transactions may get deadlocked.

The Degree of concurrency is low for coarse granularity and high for fine granularity.
When concurrent execution is permitted, the recovery process depends on the protocols
used for concurrency control.

Transaction table: Contains an entry for each active transaction, with information such
as transaction ID, transaction status and the LSN of the most recent log record for the
transaction.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Dirty Page table: Contains an entry for each dirty page in the buffer, which includes the
page ID and the LSN corresponding to the earliest update to that page.

A multidatabase system is a special distributed database system where one node may be
running relational database system under Unix, another may be running object-oriented
system under window and so on. A transaction may run in a distributed fashion at
multiple nodes. In this execution scenario the transaction commits only when all these
multiple nodes agree to commit individually the part of the transaction they were
executing.

Page 158
Advanced RDBMS

This commit scheme is referred to as “two-phase commit” (2PC). If any


one of these nodes fails or cannot commit the part of the transaction, then the transaction
is aborted. Each node recovers the transaction under its own recovery protocol.

The discretionary access control technique of granting and revoking privileges is an all –
or-nothing method.

The recovery manager of a DBMS must also be equipped to handle more catastrophic
failures such as disk crashes

Statistical database security techniques must prevent the retrieval of individual data
In some cases it may be possible to infer the values of individual tuples from a sequence
of statistical queries

4.6. Terminal Questions


1. ______________ helps in isolation among conflicting transactions that takes part
in database management.
2. What is the purpose of concurrency control?
3. ______ table Contains an entry for each dirty page in the buffer, which includes
the page ID and the LSN corresponding to the earliest update to that page.
4. List the main recovery techniques.
5. What is time Stamp?
6. What is ARIES Algorithm based on?

4.7 Supplementary Materials

[BellGris92] Bell, David and Jane Grisom, Distributed Database Systems. Workinham,
England: Addison Wesley, 1992.

[Bert92] Bertino, Elisa, “Data Hiding and Security in Object-Oriented Databases,” In


proceedings Eighth International Conference on Data Engineering, 338-347, February
1992.

ANNAMALAI
ANNAMALAI UNIVERSITY
4.8 Assignment UNIVERSITY
Prepare assignment about Object-oriented Database Security.

4.9 Reference Books


Elmasri, R. & Navathe, S. B. (2000). Fundamentals of Database Systems. (3rd ed.).

[Denn87a] Denning, Dorothy E. et al., “Views for Multilevel Database Security,” In


IEEE Transactions on Software Engineering, vSE-13 n2, pp. 129-139, February 1987.

Page 159
Advanced RDBMS

[Her94] Herbert, Andrew, “Distributing Objects,” In Distributed Open Systems, F.M.T.


Brazier and D. Johansen eds., pp. 123-132, Los Alamitos: IEEE Computer Press, 1994.

[Inf96] “Illustra Object Relational Database Management System,” Informix white paper
from the Illustra Document Database, 1996.

[JajSan90] Jajodia, Sushil and Ravi Sandhu, “Polyinstantiation Integrity in Multilevel


Relations,” In Proceedings IEEE Symposium on Research in Security and Privacy, pp.
104-115, 1990.

4.10 Learning Activities

An individual or groups of peoples go to library for further activities.

4.11 Keywords
1. Concurrency control
2. Time Stamp
3. Shadow Paging
4. Immediate Update
5. Deferred update
6. Dirty Page Table
7. Deadlock

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 160
Advanced RDBMS

UNIT – V
Topics:
 Enhanced Data Models for Advanced Applications
 Temporal Database Concepts
 Spatial and Multimedia Database
 Distributed Databases and Client – Server Architecture
 Data Fragmentation, Replication and Allocation Techniques
 Types of Distributed Database Systems
 Query Processing in Distributed Databases
 Overview of Concurrency Control and Recovery in Distributed Databases
 Client- Server Architecture and its Relationship to Distributed Databases
 Distributed Databases in Oracle
 Deductive Databases
 Prolog/Datalog Notation-Interpretation of Rules
 Basic Interface Mechanisms for Logic Programs

5.0 Introduction
Enhanced Data Models for Advanced Applications are data models are an extension to
the Datamodels what we had already come across in Database Architecture. These
advance application is used to incorporate

Spatial and Multimedia databases Which are very eminently used in modern information
technology. The Temporal database on the other hand looks at the calendar events etc.,

The spatial databases deals with Geographical information system, Weather, maps etc.,

5.1 Objective
The objective of this lesson is to learn and understand the enhanced data model in the
Active Database and triggers, the concepts of the Distributed Database Management
System and the Security concern of the same. The problem areas of the security is
analysed. The terms prolog and DataLog notation and deductive databases.
ANNAMALAI
ANNAMALAI UNIVERSITY
5.2 Contents UNIVERSITY
5.2.1 Enhanced Data Models for Advanced Applications

 Active database & triggers




Page 161
Advanced RDBMS

Active database & triggers

Triggers are executed when a specified condition occurs during insert /delete/update.
Triggers are action that fire automatically based on these conditions. Triggers follow an
Event-condition-action (ECA) model.

Event:

 t, delete, update),
Condition:


Action:

 When a new employees is added to a department, modify the Totalfisal of the
Department to include the new employees salary
Condition
Logically this means that we will CREATE a TRIGGER, let
us call the trigger Total fsal1

Example: Trigger Definition

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

CREATE TRIGGER Totalfisal1 AFTER INSERT ON Employee FOR EACH ROW


WHEN (NEW.Dno is NOT NULL) UPDATE DEPARTMENT SET Totalfisal =
Totalfisal + NEW. Salary WHERE Dno = NEW.Dno;

Page 162
Advanced RDBMS

Can be FOR, AFTER, INSTEAD OF Can be INSERT, UPDATE, DELETE

Can be CREATE or ALTER

CREATE or ALTER TRIGGER


CREATE TRIGGER <name> Creates a trigger
Alters a trigger (assuming one exists)
CREATE OR ALTER TRIGGER <name>
s exist

or not

Conditions

Executes instead of the event


Note that event does not execute in this case
for modifying views

Row-Level versus Statement-level

Row-level
specifies a row-level trigger
Statement-level
ected row
Statement-level triggers

Condition

Any true/false condition to control whether a trigger is activated on not

AFTER trigger

Action
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
modifications

a. Triggers on Views

INSTEAD OF triggers are used to process view modifications

Page 163
Advanced RDBMS

b. Active Database Concepts and Triggers

An active database allows users to make the following changes to triggers (rules)
 Activate
 Deactivate
 Drop

nt can be considered in 3 ways

 Immediate consideration
 Deferred consideration
 Detached consideration

Immediate consideration:
depending on the situation:
 Before
 After
 Instead of

Deferred consideration: Condition is evaluated at the end of the transaction


Condition is evaluated in a separate transaction and the

Potential Applications for Active Databases are as follows:



 certain condition occurs



c. Triggers in SQL-99
Can alias variables inside the REFERENCING clause. Trigger examples are
Create Trigger TotalfiSal
After Update of Salary on Employee
Referencing OLD Row as O new Row as N
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
For each Row

When ( NDno IS NOT NULL)


UPDATE Department
Set Totalfisal = Totalfisal +N.salary-O.salary
Where Dno =N.Dno;

Temporal Database Concepts


Temporal Databases are with respect to Time Representation, Calendars, and Time
ordered sequence of points in some granularity.

Page 164
Advanced RDBMS

calendar organizes time into different time units for convenience. Time Representation is
in terms of

time is the time when the information from a certain transaction becomes valid.
aling with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning this is done by

a) EmpfiVT
Name ENo. Salary Dno Supervisor Name

DeptfiVT
DName DNo. Totalfisal Managerfinam

b) EmpfiTT
Name ENo. Salary Dno Supervisor Name
DeptfiTT
DName DNo. Totalfisal Managerfinam

Incorporating Time in Object-Oriented Databases Using Attribute Versioning


the object

Time varying attribute

Non-Time varying attribute


not changes over time (fixed)
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Spatial and Multimedia Databases

a. Spatial Database Concepts


ep track of objects in a multi-dimensional space
 Maps

-dimensional

Page 165
Advanced RDBMS

-dimensional spatial databases

b. Typical Spatial Queries

Range query: Finds objects of a particular type within a particular distance from a given
location

Nearest Neighbor query: Finds objects of a particular type that is nearest to a given
location
an address in Pleasanton, CA

Spatial joins or overlays: Joins objects of two types based on some spatial condition
(intersecting, overlapping, within certain distance, etc.)
-680.

R-trees
Technique for typical spatial queries. Group objects close in spatial proximity on the

cover all areas of the rectangles in its subtree.

d. Quad trees

Divide subspaces into equally sized areas. In the years ahead multimedia information
systems are expected to dominate our daily lives. Our houses will be wired for bandwidth
-definition TV/computer
workstations will have access to a large number of databases, including digital libraries,
image and video databases that will distribute vast amounts of multisource multimedia
content.

e. Multimedia Databases

Types of multimedia data are available in current systems

Text: May be formatted or unformatted. For ease of parsing structured documents,


standards like SGML and variations such as HTML are being used.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Graphics: Examples include drawings and illustrations that are encoded using some
descriptive standards (e.g. CGM, PICT, postscript).

Images: Includes drawings, photographs, and so forth, encoded in standard formats such
as bitmap, JPEG, and MPEG. Compression is built into JPEG and MPEG.

These images are not subdivided into components.


Hence querying them by content (e.g., find all images containing circles) is nontrivial.

Page 166
Advanced RDBMS

Animations: Temporal sequences of image or graphic data.

Video: A set of temporally sequenced photographic data for presentation at specified


rates– for example, 30 frames per second.

Structured audio: A sequence of audio components comprising note, tone, duration, and
so forth.
Audio: Sample data generated from aural recordings in a string of bits in digitized form.
Analog recordings are typically converted into digital form before storage.

Composite or mixed multimedia data: A combination of multimedia data types such as


audio and video which may be physically mixed to yield a new storage format or
logically mixed while retaining original types and formats. Composite data also contains
additional control information describing how the information should be rendered.

characteristics.

5.2.2. Distributed Database and Client – Server Architecture

Distributed Database Concepts

The distributed database has all of the security concerns of a single-site database plus
several additional problem areas. We begin our investigation with a review of the security
elements common to all database systems and those issues specific to distributed systems.

A secure database must satisfy the following requirements (subject to the specific
priorities of the intended application):

1. It must have physical integrity (protection from data loss caused by power failures or
natural disaster),
2. It must have logical integrity (protection of the logical structure of the database),
3. It must be available when needed,
4. The system must have an audit system,
5. It must have elemental integrity (accurate data),
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
6. Access must be controlled to some degree depending on the sensitivity of the data,
7. A system must be in place to authenticate the users of the system, and
8. Sensitive data must be protected from inference [Pflee89].

The following discussion focuses on requirements 5-8 above, since these security areas
are directly affected by the choice of DBMS model. The key goal of these requirements is
to ensure that data stored in the DBMS is protected from unauthorized observation or
inference, unauthorized modification, and from inaccurate updates.

Page 167
Advanced RDBMS

This can be accomplished by using access controls, concurrency controls, updates using
the two-phase commit procedure (this avoids integrity problems resulting from physical
failure of the database during a transaction), and inference reduction strategies The level
of access restriction depends on the sensitivity of the data and the degree to which the
developer adheres to the principal of least privilege (access limited to only those items
required to carry out assigned tasks).
Typically, a lattice is maintained in the DBMS that stores the access privileges of
individual users. When a user logs on, the interface obtains the specific privileges for the
user.

According to Pfleeger [Pflee89], access permission may be predicated on the satisfaction


of one or more of the following criteria:

(1) Availability of data: Unavailability of data is commonly caused by the locking of a


particular data element by another subject, which forces the requesting subject to wait in
a queue.

(2) Acceptability of access: Only authorized users may view and or modify the data. In a
single level system, this is relatively easy to implement. If the user is unauthorized, the
operating system does not allow system access. On a multilevel system, access control is
considerably more difficult to implement, because the DBMS must enforce the
discretionary access privileges of the user.

(3) Assurance of authenticity: This includes the restriction of access to normal working
hours to help ensure that the registered user is genuine. It also includes a usage analysis
which is used to determine if the current use is consistent with the needs of the registered
user, thereby reducing the probability of a fishing expedition or an inference attack.
Concurrency controls help to ensure the integrity of the data. These controls regulate the
manner in which the data is used when more than one user is using the same data
element. These are particularly important in the effective management of a distributed
system, because, in many cases, no single DBMS controls data access. If effective
concurrency controls are not integrated into the distributed system, several problems can
arise.

Bell and Grisom [BellGris92] identify three possible sources of concurrency problems:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(1) Lost update: A successful update was inadvertently erased by another user.

(2) Unsynchronized transactions that violate integrity constraints.

(3) Unrepeatable read: Data retrieved is inaccurate because it was obtained during an
update. Each of these problems can be reduced or eliminated by implementing a suitable
locking scheme (only one subject has access to a given entity for the duration of the lock)
or a timestamp method (the subject with the earlier timestamp receives priority)
Special problems exist for a DBMS that has multilevel access. In a multilevel access
system, users are restricted from having complete data access. Policies restricting user

Page 168
Advanced RDBMS

access to certain data elements may result from secrecy requirements, or they may result
from adherence to the principal of least privilege (a user only has access to relevant
information). Access policies for multilevel systems are typically referred to as either
open or closed. In an open system, all the data is considered unclassified unless access to
a particular data element is expressly forbidden. A closed system is just the opposite. In
this case, access to all data is prohibited unless the user has specific access privileges.
Classification of data elements is not a simple task. This is due, in part, to conflicting
goals. The first goal is to provide the database user with access to all non-sensitive data.
The second goal is to protect sensitive data from unauthorized observation or inference.
For example, the salaries for all of a given firm's employees may be considered non-
sensitive as long as the employee's names are not associated with the salaries. Legitimate
use can be made of this data. Summary statistics could be developed such as mean
executive salary and mean salary by gender. Yet an inference could be made from this
data. For example, it would be fairly easy to identify the salaries of the top executives.
Another problem is data security classification. There is no clear-cut way to classify data.
Millen and Lunt [MilLun92] demonstrate the complexity of the problem:

They state that when classifying a data element, there are three dimensions:

1. The data may be classified.


2. The existence of the data may be classified.
3. The reason for classifying the data may be classified [MilLun92].

The first dimension is the easiest to handle. Access to a classified data item is simply
denied. The other two dimensions require more thought and more creative strategies. For
example, if an unauthorized user requests a data item whose existence is classified, how
does the system respond? A poorly planned response would allow the user to make
inferences about the data that would potentially compromise it.

Key Issues in Distributed Databases

Three key issues we have to consider in DDS are:


• Data Allocation: where are data placed? Data should be stored at site with "optimal"
distribution.
• Fragmentation: relation may be divided into a number of sub-relations (called
fragments) , which are stored in different sites.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
• Replication: copy of fragment may be maintained at several sites.

Definition and allocation of fragments carried out strategically to achieve:


• Locality of Reference
• Improved Reliability and Availability
• Improved Performance
• Balanced Storage Capacities and Costs
• Minimal Communication Costs.
• Involves analysing most important transactions, based on quantitative/qualitative
information.

Page 169
Advanced RDBMS

a. Data Allocation
Four strategies regarding placement of data are:
• Centralized
• Partitioned (or Fragmented)
• Complete Replication
• Selective Replication

• Centralized: Consists of single database stored at one site with users distributed across
the network.
• Partitioned: Database partitioned into disjoint fragments, each fragment assigned to
one site.
• Complete Replication: Consists of maintaining complete copy of database at each site.
• Selective Replication: Combination of partitioning, replication, and centralization.

b. Data Fragmentation

In a DDS, it is important to determine the site used to store the data.


In order to assess the need for a distributed database system, the required partitioning of
the data or fragmentation must first be studied. The distributed database can involve both
horizontal and vertical partitioning. Four types of fragmentation are:
• Horizontal
• Vertical
• Mixed

Mixed Fragmentation

Horizontal partitioning means that a record is stored at every location.


Vertical partitioning means that the parts of the record are stored in different locations.

c. Available Network

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The design of distributed database systems is strongly influenced by the type of
underlying WAN or LAN. Distributed database systems involving vertical partitioning
can run only on those networks that are connected continuously - at least during the hours
when the distributed database is operational.

Networks that are not continuously connected typically do not allow transactions across
sites, but may keep local copies of remote data and refresh the copies periodically. For
example, a nightly backup might be taken. For applications where consistency is not
critical, this is acceptable. This is also acceptable for systems involving horizontal
partitioning of the data.

Page 170
Advanced RDBMS

d. Transaction Management

This is used when vertical partitioning is used and special techniques must be applied in
order to ensure that the transaction is applied in two different databases so as not to cause
inconsistency. This technique is called the two-phase commit.

It is recommended that the DBMS vendor provide the distributed transaction


management software. The supplier should not attempt to write transaction management
code nor buy a third party product for such a purpose.

e. Replication

Replication is the process of synchronizing several copies of the same records or record
fragments located at different sites and is used to increase the availability of data and to
speed query evaluation.

The supplier must lay out a detailed Replication Plan including

 The partitioning of the data and how to select data field names and key values so
as not to cause conflicts between sites
 The timing of the replication (i.e., synchronous vs. asynchronous)
 Resolution of potentially conflicting updates at different sites and ways for
detecting them

Note that suppliers feel that they can handle replication and especially an asynchronous
one (i.e., copying numerous records from one database to the other).
Unless such activities are labeled remote backups, it is recommended that the DBMS
vendor provide the replication software. The supplier should not attempt to write
replication code nor buy a third party product for such a purpose.

Types of Distributed Databases

1. Homogeneous DDBMS:
• All sites use same DBMS product (eg.Oracle)
• Fairly easy to design and manage.
2. Heterogeneous DDBMS:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
• Sites may run different DBMS products (eg. Oracle and Ingress)
• Possibly different underlying data models (eg. Relational DB and OO database)
• Occurs when sites have implemented their own databases and integration is
considered later.

Query Processing in Distributed Databases

Let us understand the Query Processing with respect to Employee and Department
Relations with no fragmentations.
The processing of a Distributed Query can be done based on the following strategies:

Page 171
Advanced RDBMS

1. Transfer the Employee and Department Relations to the result site.


2. Transfer the Employee relation to Site A where Department relation is located, execute
the query and send the result to the output site.
3. Transfer the Department relation to Site B where Employee relation is located, execute
the query and send the result to the output site.

Concurrency Control and Recovery in DDS


Concurrency control is dealing with multiple copies of data items – making a copy
consistent with other copies if a site on which copies stored fails and recovers later.

Recovery in DDS is taken care in terms of the following:

1. Failure of Individual sites – when a site recovers its local data must be brought
upto date.
2. Failure of communication Link – the system must be able to deal with failure of
one or more communication links.
3. Distributed Commit – Problem is usually solved two-phase commit protocol.
4. Distributed Deadlock – Techniques for dealing with deadlocks must be followed.

Assume that you and I both read the same row from the Customer table, we both change
the data, and then we both try to write our new versions back to the database. Whose
changes should be saved? Yours? Mine? Neither? A combination? Similarly, if we
both work with the same Customer object stored in a shared object cache and try to make
changes to it, what should happen?

To understand how to implement concurrency control within your system you must start
by understanding the basics of collisions – you can either avoid them or detect and then
resolve them. The next step is to understand transactions, which are collections of
actions that potentially modify two or more entities. On modern software development
projects, concurrency control and transactions are not simply the domain of databases,
instead they are issues that are potentially pertinent to all of your architectural tiers.

a. Collisions

In Implementing Referential Integrity and Shared Business Logic, the referential integrity
challenges are implemented that result from there being an object schema that is mapped
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
to a data schema, which is a cross-schema referential integrity problems. With respect to
collisions things are a little simpler, we only need to worry about the issues with ensuring
the consistency of entities within the system of record. The system of record is the
location where the official version of an entity is located. This is often data stored within
a relational database although other representations, such as an XML structure or an
object, are also viable.

A collision is said to occur when two activities, which may or may not be full-fledged
transactions, attempt to change entities within a system of record. There are three
fundamental ways

Page 172
Advanced RDBMS

1. Dirty read. Activity 1 (A1) reads an entity from the system of record and then
updates the system of record but does not commit the change (for example, the
change hasn’t been finalized). Activity 2 (A2) reads the entity, unknowingly
making a copy of the uncommitted version. A1 rolls back (aborts) the changes,
restoring the entity to the original state that A1 found it in. A2 now has a version
of the entity that was never committed and therefore is not considered to have
actually existed.
2. Non-repeatable read. A1 reads an entity from the system of record, making a
copy of it. A2 deletes the entity from the system of record. A1 now has a copy of
an entity that does not officially exist.
3. Phantom read. A1 retrieves a collection of entities from the system of record,
making copies of them, based on some sort of search criteria such as “all
customers with first name Bill.”A2 then creates new entities, which would have
met the search criteria (for example, inserts “Bill Klassen” into the database),
saving them to the system of record. If A1 reapplies the search criteria it gets a
different result set.

b. Locking Strategies

So what can you do? First, you can take a pessimistic locking approach that avoids
collisions but reduces system performance. Second, you can use an optimistic locking
strategy that enables you to detect collisions so you can resolve them. Third, you can
take an overly optimistic locking strategy that ignores the issue completely.

Pessimistic locking: is an approach where an entity is locked in the database for the
entire time that it is in application memory (often in the form of an object). A lock either
limits or prevents other users from working with the entity in the database.

Optimistic locking: With multi-user systems it is quite common to be in a situation


where collisions are infrequent. For example, a case where two people are working with
Customer objects, but with different customers and therefore they won’t collide. When
this is the case optimistic locking becomes a viable concurrency control strategy. The
idea is that you accept the fact that collisions occur infrequently, and instead of trying to
prevent them you simply choose to detect them and then resolve the collision when it
does occur.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Overly Optimistic Locking: With the strategy you neither try to avoid nor detect
collisions, assuming that they will never occur. This strategy is appropriate for single
user systems, systems where the system of record is guaranteed to be accessed by only
one user or system process at a time, or read-only tables. These situations do occur. It is
important to recognize that this strategy is completely inappropriate for multi-user
systems.

Page 173
Advanced RDBMS

5.2.3. Overview of client server architecture and its relationship to distributed


databases

Evolving of Client-Server Architecture

The term client/server was first used in the 1980s in reference to personal computers
(PCs) on a network. The actual client/server model started gaining acceptance in the late
1980s. The client/server software architecture is a versatile, message-based and modular
infrastructure that is intended to improve usability, flexibility, interoperability, and
scalability as compared to centralized, mainframe, time sharing computing.

A client is defined as a requester of services and a server is defined as the provider of


services. A single machine can be both a client and a server depending on the software
configuration.

Mainframe architecture (not a client/server architecture). With mainframe software


architectures all intelligence is within the central host computer. Users interact with the
host through a terminal that captures keystrokes and sends that information to the host.

File sharing architecture (not a client/server architecture). The original PC networks


were based on file sharing architectures, where the server downloads files from the
shared location to the desktop environment. The requested user job is then run (including
logic and data) in the desktop environment. File sharing architectures work if shared
usage is low, update contention is low, and the volume of data to be transferred is low.

Client/server architecture. As a result of the limitations of file sharing architectures, the


client/server architecture emerged. This approach introduced a database server to replace
the file server. Using a relational database management system (DBMS), user queries
could be answered directly. The client/server architecture reduced network traffic by
providing a query response rather than total file transfer. It improves multi-user updating
through a GUI front end to a shared database. In client/server architectures, Remote
Procedure Calls (RPCs) or standard query language (SQL) statements are typically used
to communicate between the client and server.

Relationship between Client-Server Architecture and Distributed Databases

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Three tier architectures. The three tier architecture (see Three Tier Software
Architectures) (also referred to as the multi-tier architecture) emerged to overcome the
limitations of the two tier architecture. In the three tier architecture, a middle tier was
added between the user system interface client environment and the database
management server environment. There are a variety of ways of implementing this
middle tier, such as transaction processing monitors, message servers, or application
servers. The middle tier can perform queuing, application execution, and database
staging. For example, if the middle tier provides queuing, the client can deliver its request
to the middle layer and disengage because the middle tier will access the data and return
the answer to the client. In addition the middle layer adds scheduling and prioritization

Page 174
Advanced RDBMS

for work in progress. The three tier client/server architecture has been shown to improve
performance for groups with a large number of users (in the thousands) and improves
flexibility when compared to the two tier approach. Flexibility in partitioning can be a
simple as "dragging and dropping" application code modules onto different computers in
some three tier architectures. A limitation with three tier architectures is that the
development environment is reportedly more difficult to use than the visually-oriented
development of two tier applications.

Three tier architecture with transaction processing monitor technology. The most
basic type of three tier architecture has a middle layer consisting of Transaction
Processing (TP) monitor technology (see Transaction Processing Monitor Technology).
The TP monitor technology is a type of message queuing, transaction scheduling, and
prioritization service where the client connects to the TP monitor (middle tier) instead of
the database server. The transaction is accepted by the monitor, which queues it and then
takes responsibility for managing it to completion, thus freeing up the client. When the
capability is provided by third party middleware vendors it is referred to as "TP Heavy"
because it can service thousands of users.

Three tier with message server. Messaging is another way to implement three tier
architectures. Messages are prioritized and processed asynchronously. Messages consist
of headers that contain priority information, and the address and identification number.
The message server connects to the relational DBMS and other data sources.

Three tier with an application server. The three tier application server architecture
allocates the main body of an application to run on a shared host rather than in the user
system interface client environment. The application server does not drive the GUIs;
rather it shares business logic, computations, and a data retrieval engine.

Three tier with an ORB architecture. Currently industry is working on developing


standards to improve interoperability and determine what the common Object Request
Broker (ORB) will be. Developing client/server systems using technologies that support
distributed objects holds great pomise, as these technologies support interoperability
across languages and platforms, as well as enhancing maintainability and adaptability of
the system. There are currently two prominent distributed object technologies:

 Common Object Request Broker Architecture (CORBA)


 ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
COM/DCOM (see Component Object Model (COM), DCOM, and Related
Capabilities).

Industry is working on standards to improve interoperability between CORBA and


COM/DCOM. The Object Management Group (OMG) has developed a mapping
between CORBA and COM/DCOM that is supported by several products.

Page 175
Advanced RDBMS

Security Problems Unique to Distributed Database Management Systems


Centralized or Decentralized Authorization

In developing a distributed database, one of the first questions to answer is where to grant
system access.
Bell and Grisom [BellGris92] outline two strategies:
(1) Users are granted system access at their home site.
(2) Users are granted system access at the remote site.

The first case is easier to handle. It is no more difficult to implement than a centralized
access strategy. Bell and Grisom point out that the success of this strategy depends on
reliable communication between the different sites (the remote site must receive all of the
necessary clearance information). Since many different sites can grant access, the
probability of unauthorized access increases. Once one site has been compromised, the
entire system is compromised. If each site maintains access control for all users, the
impact of the compromise of a single site is reduced (provided that the intrusion is not the
result of a stolen password).

The second strategy, while perhaps more secure, has several disadvantages. Probably the
most glaring is the additional processing overhead required, particularly if the given
operation requires the participation of several sites. Furthermore, the maintenance of
replicated clearance tables is computationally expensive and more prone to error. Finally,
the replication of passwords, even though they're encrypted, increases the risk of theft.
A third possibility offered by Woo and Lam [WooLam92] centralizes the granting of
access privileges at nodes called policy servers. These servers are arranged in a network.
When a policy server receives a request for access, all members of the network determine
whether to authorize the access of the user. Woo and Lam believe that separating the
approval system from the application interface reduces the probability of compromise.

a. Integrity

Preservation of integrity is much more difficult in a heterogeneous distributed database


than in a homogeneous one. The degree of central control dictates the level of difficulty
with integrity constraints (integrity constraints enforce the rules of the individual
organization). The homogeneous distributed database has strong central control and has
identical DBMS schema. If the nodes in the distributed network are heterogeneous (the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
DBMS schema and the associated organizations are dissimilar), several problems can
arise that will threaten the integrity of the distributed data.

They list three problem areas:


1. Inconsistencies between local integrity constraints,
2. Difficulties in specifying global integrity constraints,
3. Inconsistencies between local and global constraints [BellGris92].

Bell and Grisom explain that local integrity constraints are bound to differ in a
heterogeneous distributed database. The differences stem from differences in the

Page 176
Advanced RDBMS

individual organizations. These inconsistencies can cause problems, particularly with


complex queries that rely on more than one database. Development of global integrity
constraints can eliminate conflicts between individual databases. Yet these are not always
easy to implement.

Global integrity constraints on the other hand are separated from the individual
organizations. It may not always be practical to change the organizational structure in
order to make the distributed database consistent. Ultimately, this will lead to
inconsistencies between local and global constraints. Conflict resolution depends on the
level of central control. If there is strong global control, the global integrity constraints
will take precedence. If central control is weak, local integrity constraints will.

5.2.4 Distributed Databases in Oracle


The benefits of the site autonomy in an Oracle distributed database include
 Nodes of the system can mirror the logical organization of companies or groups
that need to maintain independence

 Local administrators control corresponding local data . Therefore, each database


administrators domain of responsibility is smaller and more manageable.

 Independent Failures are less likely to disrupt other nodes of the distributed
database. No single database failure need halt all distributed operations or be a
performance bottleneck

 Administrators can recover from isolated system failures independent of other


nodes in the system.

 A data dictionary exists for each local database- a global catalog is not necessary
to access local data

 Nodes can upgrade software independently.

Future prospects of Client-Server Technology

The database server is the Oracle software managing a database and a client is an
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
application that requests information from a server. Each computer in a network is a node
that can host one or more databases. Each node in a distributed database system can act
as a client, a server or both depending on the situation.

The host for the HQ database is acting as a database server when a statement is issued
against its local data, but is acting as a client when it issues a statement against remote
data

Page 177
Advanced RDBMS

Since there is a ever growing technology and the development of Distributed Data
processing and Database management the growth of Client –Server technology is very
promising.

5.2.5 Deductive Databases

What is a deductive database system?


A deductive database can be defined as an advanced database augmented with an
inference system.

Database + Inference Deductive


database
By evaluating rules against facts, new facts can be derived, which in turn can be used to
answer queries. It makes a database system more powerful.

• Some basic concepts from logic

To understand the deductive database system well, some basic concepts from
mathematical logic are needed.
- term
- n-ary predicate
- literal
- (well-formed) formula
- clause and Horn-clause
- facts
- logic program

- term
A term is a constant, a variable or an expression of the form f(t1, t2, ..., tn),
where t1, t2, ..., tn are terms and f is a function symbol.
- Example: a, b, c, f(a, b), g(a, f(a, b)), x, y, g(x, y)

- n-ary predicate
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
An n-ary predicate symbol is a symbol p appearing in an expression of the form
p(t1, t2, ..., tn), called an atom, where t1, t2, ..., tn are terms. p(t1, t2, ..., tn) can only
evaluate to true or false.
-Example: p(a, b), q(a, f(a, b)), p(x, y)

- literal
A literal is either an atom or its negation.
-Example: p(a, f(a, b)), p(a, f(a, b))

- (well-formed) formula

Page 178
Advanced RDBMS

-A well-formed (logic) formula is defined inductively as follows:


- An atom is a formula.
- If P and Q are formulas, then so are P, (PQ), (PQ), (PQ), and (PQ).
- If x is a variable and P is a formula containing x, then (xP) and (xP) are
formulas.

- clause
- A clause is an expression of the following form:
A1  A2  ...  An  B1  ...  Bm

where Ai and Bj are atoms.


- The above expression can be written in the following equivalent form:

B1  ...  Bm  A1  ...  An

consequent
antecedent

or
B1, ..., Bm  A1 , ..., An

A B A  B A B BA
1 1 1 1 1 1
0 1 1 0 1 1
1ANNAMALAI
0 0
ANNAMALAI 1 0 0
UNIVERSITY
UNIVERSITY
0 0 1 0 0 1
- Horn clause
A Horn clause is a clause with the head containing only
one positive atom.
Bm  A1 , ..., An

- fact

Page 179
Advanced RDBMS

- A fact is a special Horn clause of the following form:


B with all variables in B being instantiated. (B  can be simply written as B.)

- logic program
A logic program is a set of Horn clauses.
Facts:
supervise(franklin, john),
supervise(franklin, ramesh),
supervise(franklin, joyce)
supervise(james, franklin),
supervise(jennifer, alicia),
supervise(jennifer, ahmad),
supervise(james, jennifer).

Rules:
superior(X, Y)  supervise(X, Y),
superior(X, Y)  supervise(X, Z), superior(Z, Y),
subordinary(X, Y)  superior(Y, X).

• Basic inference mechanism for logic programs


- interpretation of programs (rules + facts)

There are two main alternatives for interpreting the theoretical meaning of rules:
proof theoretic, and
model theoretic interpretation

Proof Theoretic Interpretation

1. The facts and rules are considered to be true statements, or axioms.


facts - ground axioms
rules - deductive axioms

2. The deductive axioms are used to construct proofs that derive new facts from existing
facts.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example:
1. superior(X, Y)  supervise(X, Y). (rule 1)
2. superior(X, Y)  supervise(X, Z), superior (Z, Y). (rule 2)

3. supervise(jennifer, ahmad). (ground axiom, given)


4. supervise(james, jennifer). (ground axiom, given)

5. superior(jennifer, ahmad). (apply rule 1 on 3)

Page 180
Advanced RDBMS

6. superior(james, ahmad). (apply rule 2 on 4 and 5)


Model Theoretic Interpretation

1. Given a finite or an infinite domain of constant values, assign to each predicate in the
program every possible combination of values as arguments.

2. All the instantiated predicates contitute a Herbrand base.

3. An interpretation is a subset of the Herbrand base.

4. In the Herbrand base, each instantiated predicate evaluates to true or false in terms of
the given facts and rules.

5. An interpretation is called a model for a specific set of rules and the corresponding
facts if those rules are always true under that interpretation.

6. A model is a minimal model for a set of rules and facts if we cannot change any
element in the model from true to false and still get a model for these rules and facts.

Example:

1. superior(X, Y)  supervise(X, Y). (rule 1)

2. superior(X, Y)  supervise(X, Z), superior(Z, Y). (rule 2)


known facts:

supervise(franklin, john), supervise(franklin, ramesh),


supervise(franklin, joyce), supervise(james, franklin),
supervise(jennifer, alicia), supervise(jennifer, ahmad),
supervise(james, jennifer).

For all other possible (X, Y) combinations supervise(X, Y) is false.

domain = {james, franklin, john, ramesh, joyce, jennifer, alicia, ahmad}

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Interpretation - model - minimal model
known facts:
supervise(franklin, john), supervise(franklin, ramesh),
supervise(franklin, joyce), supervise(james, franklin),
supervise(jennifer, alicia), supervise(jennifer, ahmad),
supervise(james, jennifer).

For all other possible (X, Y) combinations supervise(X, Y) is false.

derived facts:

Page 181
Advanced RDBMS

superior(franklin, john), superior(franklin, ramesh),


superior(franklin, joyce), superior(jennifer, alicia),
superior(jennifer, ahmad), superior(james, franklin),
superior(james, jennifer), superior(james, john),
superior(james, ramesh), superior(james, joyce),
superior(james, alicia), superior(james, ahmad).

For all other possible (X, Y) combinations superior(X, Y) is false.

The above interpretation is also a model for the rules (1) and (2) since each of them
evaluates always to true under the interpretation. For example,

superior(X, Y)  supervise(X, Y)

superior(franklin, john)  supervise(franklin, john) is true.


superior(franklin, ramesh)  supervise(franklin, ramesh) is true.
... …

superior(X, Y)  supervise(X, Z), superior(Z, Y)

superior(james, ramesh)  supervise(james, franklin),


superior (franklin, ramesh) is true.
superior(james, alicia)  supervise(james, jennifer),
superior (jennifer, alicia) is true

The model is also the minimal model for the rule (1) and (2) and the corresponding facts
since eliminating any element from the model will make some facts or instatiated rules
evaluate to false.

For example,

eliminating supervise(franklin, john) from the model will make this fact no more
true under the interpretation;

eliminating superior (james, ramesh) will make the following rule no more true
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
under the interpretation:

superior(james, ramesh)  supervise(james, franklin),


superior(franklin, ramesh)

- Inference mechanism

In general, there are two approaches to evaluating logicalprograms:


bottom-up and top-down.

Page 182
Advanced RDBMS

a. Bottom-up mechanism

1. The inference engine starts with the facts and applies the rules to generate new facts.
That is, the inference moves forward from the facts toward the goal.

2. As facts are generated, they are checked against the query predicate goal for a match.

Example
query goal: superior(james, Y)?
rules and facts are given as above.

1.Check whether any of the existing facts directly matches the query.

2.Apply the first rule to the existing facts to generate new facts.

3.Apply the second rule to the existing facts to generate new facts.

4.As each fact is gnerated, it is checked for a match of the query goal.

5.Repeat step 1 - 4 until no more new facts can be found.

Example:

1. superior(X, Y)  supervise(X, Y). (rule 1)

2. superior(X, Y)  supervise(X, Z), superior(Z, Y). (rule 2)

known facts:
supervise(franklin, john), supervise(franklin, ramesh),
supervise(franklin, joyce), supervise(james, franklin),
supervise(jennifer, alicia), supervise(jennifer, ahmad),
supervise(james, jennifer).

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
For all other possible (X, Y) combinations supervise(X, Y) is false.
domain = {james, franklin, john, ramesh, joyce, jennifer, alicia, ahmad}
superior(james, Y)?

applying the first rule: superior(james, franklin), superior(james, jennifer)


Y = {franklin, jennifer}

applying the second rule: Y = {John, Joyce, Ramesh, alicia, ahmad}

Page 183
Advanced RDBMS

b. Top-down mechanism

(also called back chaining and top-down resolution)

1. The inference engine starts with the query goal and attempts to find matches to the
variables that lead to valid facts in the database. That is, the inference moves backward
from the intended goal to determine facts that would satisfy the goal.

2. During the course, the rules are used to generate subgoals. The matching of these
subgoals will lead to the match of the intended goal.

5.2.6 Prolog/Datalog Notation

Predicate has

 ame





Rule

 Is of the form head :- body


 where :- is read as if and only iff
 E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)
 E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

Query


 variable arguments to answer the question
 - is read as if and only iff


ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SUPERIOR(james,Y)?

(a) Prolog notation (b) Supervisory tree


supervise(franklin, john), James
supervise(franklin, ramesh),
supervise(franklin, joyce), franklin Jennifer
supervise(james, franklin),
supervise(jennifer, alicia),
supervise(jennifer, ahmad), John Ramesh joyce Alicia Ahamed

Page 184
Advanced RDBMS

supervise(james, jennifer).
Interpretation of Rules

There are two main alternatives for interpreting rules:


 Proof-theoretic
 Model-theoretic

Proof-theoretic: Facts and rules for Ground axioms. Ground axioms contain no
variables

Rules for deductive axioms are - Deductive axioms can be used to construct new facts
This process is known as theorem proving or Proving a new fact

Model-theoretic: Given a finite or infinite domain of constant values, we assign the

it is called interpretation

Model: interpretation for a specific set of rules

-theoretic proofs

predicated are true under the interpretation, the predicate at the head of the rule must also
be true

b. Minimal model
annot change any fact from true to false and still get a model for these rules

5.2.7. Basic interface mechanism for logic programs

The Resource Description Framework (RDF) Model&Syntax Specification describes a


metadata infrastructure which can accommodate classification elements from different
vocabularies i.e. schemas. The underlying model consists of a labeled directed acyclic
graph which can be linearized into eXtensible Markup Language (XML) transfer syntax
for interchange between applications.

ANNAMALAI
ANNAMALAI UNIVERSITY
Query Languages UNIVERSITY
In general, query languages are formal languages to retrieve data from a database.
Standardized languages already exist to retrieve information from different types of
databases such as Structured Query Language (SQL) for relational databases and Object
Query Language (OQL) and SQL3 for object databases.

Page 185
Advanced RDBMS

Semi-structure query languages such as XML-QL [3] operate on the document level
structure.

Logic programs consist of facts and rules where valid inference rules are used to
determine all the facts that apply within a given model.

With RDF, the most suitable approach is to focus on the underlying data model. Even
though XML-QL could be used to query RDF descriptions in their XML encoded form, a
single RDF data model could not be correctly determined with a single XML-QL query
due to the fact that RDF allows several XML syntax encodings for the same data model.

The Metalog Approach

RDF provides the basis for structuring the data present in the web in a consistent and
accurate way. However, RDF is only the first step towards the construction of what Tim
Berners-Lee calls the "web of knowledge", a World Wide Web where data is structured,
and users can fully benefit by this structure when accessing information on the web. RDF
only provides the "basic vocabulary" in which data can be expressed and structured.
Then, the whole problem of accessing an managing these data structured arises.

Metalog provides a "logical" view of metadata present on the web. The Metalog approach
is composed by several components.

In the first component, a particular data semantics is established. Metalog provides way
to express logical relationships like "and", "or" and so on, and to build up complex
inference rules that encode logical reasoning. This "semantic layer" builds on top of RDF
using a so-called RDF schema.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The second component consists of a "logical interpretation" of RDF data (optionally
enriched with the semantic schema) into logic programming. This way, the understood
semantics of RDF is unwielded into its logical components (a logic program, indeed).
This means that every reasonment on RDF data can be performed acting upon the
corresponding logical view, the logic program, providing a neat and powerful way to
reason about data.

Page 186
Advanced RDBMS

The third component is a language interface to writing structured data and reasoning
rules. In principle, the first component already suffices: data and rules can be written
directly in RDF, using RDF syntax and the metalog schema. RDF syntax aims at being
more an encoding language rather than a user-friendly language, and it is well recognised
in the RDF community and among vendors that the typical applications will provide
more user-friendly interfaces between the "raw RDF" code and the user.

Another important feature of the language, in this respect, is indeed that it can be used
just as an interface to RDF, without the metalog extensions. This way, users will be able
to access and structure metadata using RDF in a smooth and seamless way, using the
metalog language.

The Metalog Schema

The first correspondence in Metalog is between the basic RDF data model and the
predicates in logic. The RDF data model consists of so-called statements Statements are
triples where there is a subject (the "resource"), a predicate (the "property"), and an
object (the "literal"). Metalog views an RDF statement in the logical setting as just a
binary predicate involving the subject and the literal. For example, the RDF statement is
seen in logic programming as the predicate.

Once established the basic correspondence between the basic RDF data model and
predicates in logic, the next step comes easy: we can extend RDF so that the mapping to
logic is able to take advantage of all of the logical relationships present in logical
systems: that is to say, behind the ability of expresing static facts, we want the ability to
encode dynamic reasoning rules, like in logic programming.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In order to do so, we need at least:

 the standard logical connectors (and, or, not)


 variables

The metalog schema extends plain RDF with this "logical layer", enabling to express
arbitrary logical relationships within RDF. In fact, the metalog schema provides more
accessories besides the aforementioned basic ones (like for example, the "implies"
connector): anyway, not to heaven the discussion, we don't go into further details on this

Page 187
Advanced RDBMS

topic. What the reader should keep in mind is just that the Metalog schema provides the
"meta-logic" operators to reason with RDF statements.

Technically, this is quite easy to do: the metalog schema is just a schema as defined by
the RDF schema specification where, for example, and and or are subinstances of the
RDF Bag connector.

The mapping between "metalog RDF" and logical formulas is then completely natural:
for each RDF statement that does not use a metalog connector, there is a corresponding
logical predicate as defined before. Then, the metalog connectors are translated into the
corresponding logical connectors in the natural way (so, for instance, the metalog and
connector is mapped using logical conjunction, while the metalog or connector is mapped
using logical disjunction).

The Metalog Syntax

Note that the RDF metalog schema and the corresponding translation into logical
formulas is absolutely general. However, in practicse, one need also to then be able to
process the resulting logical formulas in an effective ways. In other words, while the RDF
metalog schema nicely extends RDF with the full power of first order predicate calculus,
thus increasing by far the expressibility of basic RDF, there is still the other,
computational, side of the coin: how to process and effectively reason with all these
logical inference rules.

It is well known that in general dealing with full first order predicate calculus is totally
unfeasable computationally. So, what we would like to have is a subset of predicate
calculus.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The third level is then the actual syntax interface between the user and this "metalog
RDF" encoding, with the constraint that the expressibility of the language must fit within
the one provided by logic programming.

The metalog syntax has been explicitly designed with the purpose of being totally
natural-language based, trying to avoid any possible technicalities, and therefore making
the language extrememly readable and self-descriptive.

Page 188
Advanced RDBMS

The way metalog reaches this scope is by a careful use of upper/lower case, quotes, and
by allowing a rather liberal positioning of the keywords (an advanced parser then
disambiguates the keywords from each metalog program line).

Datalog programs and their evaluation

1. A Datalog program is a logic program.

2. In a Datalog program, each predicate contains no function symbols.

3. A Datalog program normally contains two kinds of predicates:

fact-based predicates and rule-based predicates.

fact-based predicates are defined by listing all the combinations of values that make the
predicate true.

Rule-based predicates are defined to be the head of one or more Datalog rules. They
correspond to virtual relations whose contents can be inferred by the inference engine.

Example:

-All the programs discussed earlier are Datalog programs.


superior(X, Y)  supervise(X, Y).
superior (X, Y)  supervise(X, Z), superior (Z, Y).
supervise(jennifer, ahmad).
supervise(james, jennifer).

The following is a logic program, but not a Datalog program:


p(X, Y)  q(f(Y), X)

two important concepts:

- safety of programs
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
- predicate dependency graph

Safety of programs

A Datalog program or a rule is said to be safe if it generates a finite set of facts.


-Condition of unsafety

A rule is unsafe if one of the variables in the rule can range over an infinite domain of
values, and that variable is not limited to ranging over a finite predicate before it is
instantiated.

Page 189
Advanced RDBMS

-Example:

bigfisalary(Y)  Y > 60000.


bigfisalary(Y)  Y > 60000, employee(X), salary(X, Y).

The evaluation of these rules (no matter whether in bottom- up or in top-down fashion)
will never terminate.

The following is a safe rule:


bigfisalary(Y)  employee(X), salary(X, Y), Y > 60000.

A variable X is limited if

(1) it appears in a regular (not built-in) predicate in the body of the rule.
(built-in predicates: <, >, , , =, )

(2) it appears in a predicate of the form X = c or c = X, where c is a constant.

(3) it appears in a predicate of the form X = Y or Y = X in the rule body, where Y is a


limited variable.

(4) Before it is instantiated, some other regular predicates containing it will have been
evaluated.

Condition of safety:

A rule is safe if each variable in it is limited.


A program is safe if each rule in it is safe.

Predicate Dependency graphs

For a program P, we construct a dependency graph G representing a refer to relationship


between the predicates in P. This is a directed graph where there is node for each
predicate and an arc from node q to node p if and only if the predicate q occurs in the

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
body of a rule whose head predicate is p.

Example:
superior(X, Y)  supervise(X, Y),
superior(X, Y)  supervise(X, Z), superior(Z, Y),
subordinary(X, Y)  superior(Y, X),
supervisor(X, Y)  employee(X), supervise(X, Y),

overfi40Kfiemp(X)  employee(X), salary(X, Y), Y40000,


underfi40Kfisupervisor(X)  supervisor(X), not(overfi40Kfiemp(X)),
mainfiproductx fiemp(X ) employee(X), workson(X, productx, Y), Y  20,

Page 190
Advanced RDBMS

president(X)  employee(X), not(supervise(Y, X)).

Evaluation of nonrecursive rules

-If the dependency graph for a rule set has no cycles, the rule set is nonrecursive.

-Evaluation involving only rule-based predicate

2.Single rule evaluation

To evaluate a rule of the from:


p  p1, ..., pn
we first compute the relations corresponding to p1, ..., pn and then the relation
corresponding to p.

3. All the rules will be evaluated along the predicate dependency graph. At each step,
each rule will be evaluated in terms of step (2).

-The general bottom-up evaluation strategy for a nonrecursive query


?-p(x1, x2, …, xn)

1. Locate a set of rules S whose head involves the predicate p. If there are no such rules,
then p is a fact-based predicate corresponding to some database relation Rp; in this case,
one of the following expression is returned and the algorithm is terminated.

(a) If all arguments in p are distinct variables, the relational expression


returned is Rp.

(b) If some arguments are constants or if the same variable appears in more than one
argument position, the expression returned is

SELECT<condition>(Rp),

where the <condition> is a conjunctive condition made up of a number of simple


conditions connected by AND, and constructed as follows:

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
i. if a constant c appears as argument i, include a simple condition ($i
= c) in the conjuction.

ii. if the same variable appears in both argument location j and k, include a condition
($j = $k) in the conjuction.

2. At this point, one or more rules Si, i = 1, 2, ..., n, n > 0 exist with predicate p as
their head. For each such rule Si, generate a relational expression as follows:

Page 191
Advanced RDBMS

a.Apply selection operation on the predicates in the body for each such rule, as discussed
in Step 1(b).

b.A natural join is constructed among the relations that correspond to the predicates in the
body of the rule Si over the common variables. Let the resulting relation from this join be
Rs.

c. If any built-in predicate XY was defined over the arguments X and Y, the result of the
join is subjected to an additional selection: SELECT XY(Rs)

d. Repeat Step 2(c) until no more built-in predicates apply.

3. Take the UNION of the expressions generated in Step 2

Evaluation of recursive rules

-If the dependency graph for a rule set has at least one cycle, the rule set is recursive.

ancestor(X, Y)  parent(X, Y), ancestor(X, Y)


 parent(X, Z), ancestor(Z, Y).

- naive strategy
- semi-naive strategy
- stratified databases
- some teminology for recursive queries
- linearly recursive
- left linearly recursive
ancestor(X, Y)  ancestor(X, Z), parent(Z, Y)
- right linearly recursive
ancestor(X, Y)  parent(X, Z), ancestor(Z, Y)
- non-linearly recursive sg(X,
Y)  sg(X, Z), sibling(Z, W), sg(W, Y)

- some teminology for recursive queries


- extensional database (EDB) predicate
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
An EDB predicate is a predicate whose relation is stored in the database - fact-based
predicate.
- intensional database (IDB) predicate

An IDB predicate is a predicate whose relation is defined by logic rules - rule-based


predicate.
- Datalog equation

A Datalog equation is an equation obtained by replacing “” and “” with “=” and “ ”
in a rule, respectively.

Page 192
Advanced RDBMS

a(X, Y) = p(X, Y)  X,Y(p(X, Z) a(Z, Y))

a. naive strategy

Consider the following equation system:


Ri = Ei(R1, ..., Ri, ..., Rn) (i = 1, ..., m)
which is formed by replacing the  symbol with an equality sign in a Datalog program.

b. Algorithm Jacobi naive strategy

input: A system of algebraic equations and EDB


output: The values of the variable relations: R1, ..., Ri, ..., Rn.

for i = 1 to n do Ri := ;
repeat
Con := true;
for i = 1 to n do Si := Ri;
for i = 1 to m do {Ri := Ei(S1, ..., Si, ..., Sn);
if Ri  Si then {Con := false; Si := Ri;}}
until Con = true;

naive strategy

sg(X, Y)  sg(X, W), sibling(W, Z), sg(Z, Y)


sibling(X, Y)  parent(X, W), sibling(W, Z), parent(Y, Z)

evaluation of recursive queries

semi-naive strategy

1. The semi-naive evaluation method is a bottom-up strategy.


2. It is designed to eliminate redundancy in the evaluation of tuples at different iterations.

Let Ri(k) be the temporary value of relation Ri at iteration step k.


The differential of Ri between step k and step k - 1 is defined as follows:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Di(k) = Ri(k) - Ri(k-1)

For a linearly recursive rule set, Di(k) can be substituted for Ri in the k-th iteration of
the naïve algorithm.

3.The result is obtained by the union of the newly obtained term Ri and that obtained in
the previous step.

c. Algorithm seminaive strategy

Page 193
Advanced RDBMS

input: A system of algebraic equations and EDB.


output: The values of the variable relations: R1, ..., Ri, ..., Rn.

for i = 1 to n do Ri := ;
for i = 1 to m do Di := ;
repeat
Con := true;
for i = 1 to n do {Di := E(D1, ..., Di, ..., Dn) - Ri;
Ri := Di  Ri;
if Di   then Con := false;
}
until Con is true;

Example:

Step 0: D0 = , A0 = ;
Step 1: D1 = P = {(bert, alice), (bert, george), (alice, derek), (alice, part),
(derek, frank)}
A1 = D1  A0 = {(bert, alice), (bert, george), (alice, derek), (alice,
part), (derek, frank)}
Step 2: D2 = {(bert, derek), (bert, pat), (alice, frank)}
A2 = D2  A1 = {(bert, alice), (bert, george), (alice, derek), (alice, part),
(derek, frank), {(bert, derek), (bert, pat),(alice, frank)}

Example:

Step 3: D3 = {(bert, frank)


A3 = D3  A2 = {(bert, alice), (bert, george), (alice, derek), (alice, part),
(derek, frank), {(bert, derek), (bert, pat),(alice, frank), (bert, frank)}
Step 3: D4 = .

The advantage of the semi-naive method is that at each step a differential term
Di is used in each equation instead of the whole Ri. In this way, the time
complexity of a computation is decreased drastically.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
-The magic-set rule rewriting technique

1. During a bottom-up evaluation, too many irrelevant tuples are


evaluated.

For example, to evaluate the query sg(john, Z)? using the following rules:

sg(X, Y)  flat(X, Y),


sg(X, Y)  up(X, Z), sg(Z, W), down(W, Y),

Page 194
Advanced RDBMS

a bottom-up method will generate all sg-tuples and then makes a selection operation to
the answers.

2. Using the constants appearing in the query to restrict computation.

d. Stratified databases

A stratified database is a Datalog program containing negated predicates.


Example: Suppose that a supplier might wish to backorder items that are not in the
warehouse. It would be convenient to write:

backorder(X)  item(X), warehouse(X).


Its logically equivalent form is
backorder(X), warehouse  item(X).
But this rule has a different meaning : if X is an item, then backorder it or it is stored in
the warehouse. This is not what we want.

- Problem: recursion via negation

p(X)  q(X),
q(X)  p(X).

To avoid the recursion via negation, we introduce the concept of stratification, which is
defined by the use of a level l mapping.

level l mapping: assign each literal in the program an integer such that if

B  A1, …, An

and Ai is positive, then l(Ai)  l(B) for all i, 1  i  n. If Ai is negative, then l(B) < l(Ai)
for all i, 1  i  n.

p(X)  q(X),
ANNAMALAI
ANNAMALAI UNIVERSITY
q(X)  p(X). UNIVERSITY
To avoid the recursion via negation, we introduce the concept of stratification, which is
defined by the use of a level l mapping.

level l mapping: assign each literal in the program an integer such that if
B  A1, …, An

and Ai is positive, then l(Ai)  l(B) for all i, 1  i  n. If Ai is negative, then l(B) < l(Ai)
for all i, 1  i  n.

Page 195
Advanced RDBMS

If you can assign integers to all the literals in a program using a level mapping, then this
program is stratifiable.
p(X)  q(X),
q(X)  p(X).

In fact, we cannot find a level mapping for any program which contains recursion via
negation.

Evaluate the literals in the program from low level to the high level.

- However, you cannot find any level mapping for the following
program:

Example:

path(X, Y)  edge(X, Y),


path(X,Y)  edge(X, Z), path(Z, Y),
acyclicfipath(X, Y)  path(X,Y), path(Y, X).

We can many label mappings for this program. The following are
just two of them:

Use the Relational Operations

ned in the for of Datalog rules that


defined the result of applying these operations on database relations (fact predicates)

Evaluation of Non-recursive Datalog Queries

Define an inference mechanism based on relational database query processing concepts

5.2.8 Deductive database systems

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Deductive database systems are database management systems whose query language
and (usually) storage structure are designed around a logical model of data. As relations
are naturally thought of as the \value" of a logical predicate, and relational languages
such as SQL are syntactic sugarings of a limited form of logical expression, it is easy to
see deductive database systems as an advanced form of relational systems.

The deductive systems do, however, share with the relational systems the important
property of being declarative, that is, of allowing the user to query or update by saying
what he or she wants, rather than how to perform the operation.

Page 196
Advanced RDBMS

Declarativeness is now being recognized as an important driver of the success of


relational systems. As a result, we see deductive database technology, and the
declarativeness it engenders, iterating other branches of database systems, especially the
object-oriented world, where it is becoming increasingly important to interface object-
oriented and logical paradigms in so-called DOOD (Declarative and Object-Oriented
Database) systems.

Another important thrust has been the problem of coping with negation or nonmonotonic
reasoning, where classical logic does not over, through the conventional means of logical
deduction, an adequate definition of what some very natural logical statements \mean" to
the programmer.

Objective Of Deductive Databases

The objective of deductive databases is to provide efficient support for sophisticated


queries and reasoning on large databases; toward this goal, they combine the technology
of logic programming with that of relational databases. Deductive database research has
produced methods and techniques for implementing the declarative semantics of logical
rules via efficient computation of fixpoints. Also, advances in language design and
nonmonotonic semantics were made to allow the use of negation and set-aggregates in
recursive programs; these yield greater expressive power while retaining polynomial data
complexity and semantic well-formedness. Deductive database systems have been used
in data mining and other advanced applications, and their techniques have been
incorporated into a new generation of commercial databases

Deductive Object oriented Databases

A deductive database system is a database system which can make deductions (ie:
conclude additional rules or facts) based on rules and facts stored in the (deductive)
database. Deductive database systems:

 Mainly deal with rules and facts.


 Use a declarative language (such as Prolog) to specify those rules and facts.
 Use an inference engine which can deduce new facts and rules from those given.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
A good example of a declarative language would be Prolog, but for databases Datalog is
used more often. Datalog is both a syntactic subset of prolog and a database query
language – it is designed specifically for working with logic and databases. Deductive
databases are also known as logic databases, knowledge systems and inferential
databases. The problem domain of an expert system / deductive database is usually quite
narrow. Deductive databases are similar to expert systems - “traditional” expert systems
have assumed that all the facts and rules they need (their knowledge base) will be loaded
into main memory, whereas a deductive database uses a database (usually on disk
storage) as its knowledge base. Traditional expert systems have usually also taken their
facts and rules from a real expert in their problem domain, whereas deductive databases

Page 197
Advanced RDBMS

find their knowledge inherent in the data. Deductive databases and expert systems are
mainly used for:

 Replicating the functionality of a real expert.


 Hypothesis testing.
 Knowledge discovery (finding new relationships between data).

Applications of Commercial Deductive Database Systems

Notation, Definitions, and Some Basic Concepts

Deductive database systems divide their information into two categories:

1. Data, or facts, that are normally represented by a predicate with constant


arguments (by a ground atom). For example, the fact parent(joe; sue), means that
Sue is a parent of Joe. Here, parent is the name of a predicate, and this predicate is
represented extensionally, that is, by storing in the database a relation of all the
true tuples for this predicate. Thus, (joe; sue) would be one of the tuples in the
stored relation.

Extensional and intensional databases. Here, sg is a predicate (\same-generation"), and


the head of each of the two rules is the atomic formula p(X; Y ). X and Y are variables.
The other predicates found in the rules are flat, up, and down. These are presumably
stored extensionally, while the relation for sg is intensional, that is, defined only by the
rules. Intensional predicates play a role similar to views in conventional database
systems, although we expect that in deductive applications there will be large numbers of
intensional predicates and rules defining them, far more than the number of views defined
in typical database applications.

The first rule can be interpreted as saying that individuals X and Y are at the same
generation if they are related by the predicate flat, that is, if there is a tuple (X; Y ) in the
relation for flat.

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The second rule says that X and Y are also at the same generation if there are individuals
U and V such that:

1. X and U are related by the up predicate.


2. U and V are at the same generation.
3. V and Y are related by the down predicate.

These rules thus define the notion of being at the same generation recursively. Since
common implementations of SQL do not support general recursions such as this example

Page 198
Advanced RDBMS

without going to a host-language program, we see one of the important extensions of


deductive systems: the ability to support declarative, recursive queries.

The optimization of recursive queries has been an active research area, and has often
focused on some important classes of recursion. We say that a predicate p depends upon a
predicate q | not necessarily distinct from p | if some rule with p in the head has a subgoal
whose predicate either is q or (recursively) depends on q. If p depends upon q and q
depends upon p, p and q are said to be mutually recursive. A program is said to be linear
recursive if each rule contains at most one subgoal whose predicate is mutually recursive
with the headpredicate.

Optimization Techniques

Perhaps the hardest problem in the implementation of deductive database systems is


designing the query optimizer.

While for nonrecursive rules, the optimization problem is similar to that of conventional
relational optimization, the presence of recursive rules opens up a variety of new options
and problems. There is an extensive literature on the subject, and we shall attempt here to
give only the most basic ideas and motivation.

Sometimes, a more restrictive definition is used, requiring that no two distinct predicates
can be mutually recursive, or even that there be at most one recursive rule in the program.
We shall not worry about such distinctions.

a. Magic Sets

The problem addressed by the magic-sets rule rewriting technique is that frequently a
query asks not for the entire relation corresponding to an intensional predicate, but for a
small subset..

A top-down, or backward-chaining search would start from the query as a goal and use
the rules from head to body to create more goals, and none of these goals would be
irrelevant to the query, although some may cause us to explore paths that happen to \dead
end," because data that would lead to a solution to the query happens not to be in the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
database. Prolog evaluation is the best known example of top-down evaluation. However,
the Prolog algorithm, like all purely top-down approaches, sufiers from some problems. It
is prone to recursive loops, it may perform repeated computation of some subgoals, and it
is often hard to tell that all solutions to the query goal have been found.

On the other hand, a bottom-up or forward-chaining search, working from the bodies of
the rules to the heads, would cause us to infer facts that would never even be considered
in the top-down search. Yet bottom-up evaluation is desirable because it avoids the
problems of looping and repeated computation that are inherent in the top-down
approach. Also, bottom-up approaches allow us to use set-at-a-time operations like

Page 199
Advanced RDBMS

relational joins, which may be made efficient for disk-resident data, while the pure top-
down methods use tuple-at-a-time operations. Magic-sets is a technique that allows us to
rewite the rules for each query form (i.e., which arguments of the predicate are bound to
constants, and which are variable), so that the advantages of top-down and bottom-up
methods are combined. That is, we get the focus inherent in top-down evaluation
combined with the loopingfreedom, easy termination testing, and efficient evaluation of
bottom-up evaluation. Magic-sets is a rule-rewriting technique. We shall not give the
method, of which many variations are known and used in practice contains an
explanation of the basic techniques, and the following example should suggest the idea.

b. Other Rule-Rewriting Techniques

There are a number of other approaches to optimization that sometimes yield better
performance than magicsets.

These optimizations include the counting algorithm [BMSU86, SZ86, BR87b], the
factoring optimization [NRSU89, KRS90], techniques for deleting redundant rules and
literals [NS89, Sag88], techniques by which \existential" queries (queries for which a
single answer | any answer | suffices) can be optimized [RBK88], and \envelopes" [SS88,
Sag90]. A number of researchers [IW88, ZYT88, Sar89, RSUV89] have studied how to
transform a program that contains nonlinear rules into an equivalent one that contains
only linear rules.
c. Iterative Fixpoint Evaluation

Most rule-rewriting techniques like magic-sets expect implementation of the rewritten


rules by a bottom-up technique, where starting with the facts in the database, we
repeatedly evaluate the bodies of the rules with whatever facts are known (including facts
for the intensional predicates) and infer what facts we can from the heads of the rules.
This approach is called naive evaluation.

We can improve the eficiency of this algorithm by a simple \trick." If in some round of
the repeated evaluation of the bodies we discover a new fact f, then we must have used,
for at least one of the subgoals in the utilized rule, a fact that was discovered on the
previous round. For if not, then f itself would have been discovered in a previous round.
We may thus reorganize the substitution of facts for the subgoals so that at least one of
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
the subgoals is replaced by a fact that was discovered in the previous round.

d. Extensions of Horn-Clause Programs

A deductive database query language can be enhanced by permitting negated subgoals in


the bodies of rules.

However, we lose an important property of our rules. When rules have the form
introduced in Section 2, there is a unique minimal model of the rules and data. A model
of a program is a set of facts such that for any rule, replacing body literals by facts in the

Page 200
Advanced RDBMS

model results in a head fact that is also in the model. Thus, in the context of a model, a
rule can be understood as saying, essentially, \if the body is true, the head is true".

A minimal model is a model such that no subset is a model. The existence of a unique
minimal model, or least model, is clearly a fundamental and desirable property. Indeed,
this least model is the one computed by naive or seminaive evaluation, as discussed in
Section 3.3. Intuitively, we expect the programmer had in mind the least model when he
or she wrote the logic program. However, in the presence of negated literals, a program
may not have a least model.

An Historical Overview of Deductive Databases

The origins of deductive databases can be traced back to work in automated theorem
proving and, later, logic programming. In an interesting survey of the early development
of the field [Min87], Minker suggests that Green and Raphael [GR68] were the first to
recognize the connection between theorem proving and deduction in databases. They
developed a series of question-answering systems that used a version of Robinson's
resolution principle [Rob65], demonstrating that deduction could be carried out
systematically in a database context. 5

Other early systems included MRPPS, DEDUCE-2, and DADM. MRPPS was an
interpretive system developed at Maryland by Minker's group from 1970 through 1978
that explored several search procedures, indexing techniques, and semantic query
optimization. One of the first papers on processing recursive queries was [MN82]; it
contained the first description of bounded recursive queries, which are recursive queries
that can be replaced by nonrecursive equivalents. DEDUCE was implemented at IBM in
the mid 1970's [Cha78], and supported left-linear recursive Horn-clause rules using a
compiled approach. DADM [KT81] emphasized the distinction between EDB and IDB
and studied the representation of the IDB in the form of 'connection graphs' | closely
related to Sickel's interconnectivity graphs [Sic76] | to aid in the development of query
plans.

A landmark workshop on logic and deductive databases was organized by Gallaire,


Minker and Nicolas at Toulouse in 1977, and several papers from the proceedings
appeared in book form [GM78]. Reiter's influential paper on the closed world assumption
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(as well as a paper on compilation of rules) appeared in this book, as did Clark's paper on
negation-as-failure and a paper by Nicolas and Yazdanian on checking integrity
constraints. The workshop and the book brought together researchers in the area of logic
and databases, and gave an identity to the field. (The workshop was also organized in
subsequent years, with proceedings, and continued to influence the field.)

In 1976, van Emden and Kowalski [vEK76] showed that the least fixpoint of a Horn-
clause logic program coincided with its least Herbrand model. This provided a firm
foundation for the semantics of logic programs, and especially, deductive databases, since
fixpoint computation is the operational semantics associated with deductive databases (at

Page 201
Advanced RDBMS

least, of those implemented using bottom-up evaluation). The early work focused largely
on identifying suitable goals for the field, and on developing a semantic foundation. The
next phase of development saw an increasing emphasis on the development of efficient
query evaluation techniques. Henschen and Naqvi proposed one of the earliest efficient
techniques for evaluating 5Cordell Green received a Grace Murray Hopper award from
the ACM for his work

The area of deductive databases has matured in recent years, and it now seems
appropriate to react upon what has been achieved and what the future holds. In this paper,
we provide an overview of the area and brief describe a number of projects that have led
to implemented systems.

Deductive systems are not the only class of systems with a claim to being an extension of
relational systems.

Prolog and Databases

There are two points to consider:

Prolog's depth- first evaluation strategy leads to infinite loops, even for positive programs
and even in the absence of function symbols or arithmetic. In the presence of large
volumes of data, operational reasoning is not desirable, and a higher premium is placed
upon completeness and termination of the evaluation method.

In a typical database application, the amount of data is sufficiently large that much of it is
on secondary storage. Efficient access to this data is crucial to good performance.

The worst problem is adequately addressed by memory extensions to Prolog evaluation.


For example, one can efficiently extend the widely used Warren abstract machine Prolog
architecture [War89].

The second problem turns out to be harder. The key to accessing disk data efficiently is to
utilize the set-oriented nature of typical database operations and to tailor both the
clustering of data on disk and the management of buffers in order to minimize the
number of pages fetched from disk. Prolog's tuple-at-a-time evaluation strategy severely
curtails the implementor's ability to minimize disk accesses by re-ordering operations.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The situation can thus be summarized as follows: Prolog systems evaluate logic programs
efficiently in main memory, but are tuple-at-a-time, and therefore inefficient with respect
to disk accesses. In contrast, database systems implement only a nonrecursive subset of
logic programs (essentially described by relational algebra), but do so efficiently with
respect to disk accesses.

The goal of deductive databases is to deal with a superset of relational algebra that
includes support for recursion in a way that permits efficient handling of disk data.
Evaluation strategies should retain Prolog's goal-directed avor, but be more set-at-a-time.

Page 202
Advanced RDBMS

5.3 Revision points


 Triggers are executed when a specified condition occurs during insert
/delete/update.
 Triggers are action that fire automatically based on these conditions. Triggers
follow an Event-condition-action (ECA) model.
 Row level triggers - Executed separately for each affected row
 Statement-level triggers - Execute once for the SQL statement
 An Event can be considered as : - Immediate, Deferred and Detached.
 R-Trees is a technique for typical spatial queries.
 Semi-structure query languages such as XML-QL [3] operate on the document
level structure.üMetalog provides a "logical" view of metadata present on the
web.
 The middle tier can perform queuing, application execution, and database staging.

5.4 Intext Questions


1. Explain the prospects of client –server technology
2. Elucidate the role played by Distributed Database management
3. Discuss Datalog programs and evaluation
4. Write a note on Enhanced data models for Advanced Application

5.5 Summary
 Event
 Condition
then condition is always true
ANNAMALAI
ANNAMALAI UNIVERSITY
 Action
UNIVERSITY
 FOR EACH ROW trigger specifies a row-level trigger.
 An active database allows users to make the following changes to triggers
i. Activate
ii. Deactivate
iii. Drop
 Time varying attribute
 Key Issues in DDS are Fragmentation, Data Allocation and Replication
 Datalog Program is s Logical Program.

5.6 Terminal Exercise

Page 203
Advanced RDBMS

1. Triggers are of two types - _________ level an _________ level.


2. Text, Images and Graphics are available in ___________ databases.
3. _________ is a copy of fragment may be maintained at several sites.
4. What is Data Fragmentation?
5. What are the types of Data Fragmentation?
6. What is Two-phase commit?
7. What are the types of Distributed Databases?
8. A _________ is said to occur when two activities, which may or may not be full-
fledged transactions, attempt to change entities within a system of record.
9. What are the three locking strategies?
10. What are the different types of Architectures?
11. A _________ database can be defined as an advanced database augmented with an
inference system.
12. What are the two main alternatives for interpreting rules?

5.7 Suggested Reading


1. [MilLun92] Millen, Jonathan K., Teresa F. Lunt, “Security for Object-oriented
Database Systems,” In Proceedings IEEE Symposium on Research in Security and
Privacy, pp. 260-272,1992.
2. [Mull94] Mullins, Craig S. “The Great Debate, Force-fitting objects into a relational
database just doesn’t work well. The impedance problem is at the root of the
incompatibilities.” Byte, v19 n4, pp. 85-96, April 1994.
3. [Pflee89] Pfleeger, Charles P., (1989) Security in Computing. New Jersey: Prentice
Hall. 1989.
4. [RobCor93] Rob, Peter and Carlos Coronel, Database Systems, Belmont:
Wadsworth, 1993.

5.8 Assignments
1. Discuss in detail the Client-server architecture with Advantages.
2. Deductive Databases – discuss advantages and Disadvantages.

5.9 Reference Books

1.[Sud95] Sudama, Ram, “Get Ready for Distributed Objects,” Datamation, V41 n18, pp.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
67-71, October 1995.
2.[ThurFord95] Thuraisingham, Bhavani and William Ford, “Security Constraint
Processing In A Multilevel Secure Distributed Database Management System,” IEEE
Transactions on Knowledge and Data Engineering, v7 n2, pp. 274-293, April 1995.

5.10 Learning Activities

An individual or groups of people go to library for future activities.

5.11 Keywords

Page 204
Advanced RDBMS

1. CORBA
2. COM
3. Client-Server Architecture
4. Spatial Queries
5. Fragmentation
6. Allocation
7. Replication
8. Data log

ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY

Page 205

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy