Advanced Rdbms
Advanced Rdbms
ANNAMALAI UNIVERSITY
DIRECTORATE OF DISTANCE EDUCATION
ADVANCED RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Copyright Reserved
(For Private Circulation Only)
Advanced RDBMS
UNIT - I
Topics:
Concepts for Object – Oriented Databases
Object identity, Object Structure and Type Constructors
ODMG (Object Data Management Group)
Object Definition Language (ODL)
Object Query Language (OQL)
Overview of C++ Language Binding
Object Database Conceptual Design
Overview of the CORBA Standard for Distributed Objects
Object Relational and Extended Relational Database
Systems:
The Informix Universal Server
Object Relational features of Oracle 8
An Overview of SQL
ementation & related Issues for Extended type
1.0 Introduction
Systems 36
1.2.18 The Nested Relational Data Model
Information is represented in object-oriented database, in the form of objects as used in
36
Object-Oriented Programming. When database capabilities are combined with object
programming language capabilities, the result is an object database management system
(ODBMS). An ODBMS makes database objects appear as programming language objects
in one or more object programming languages. An ODBMS supports the programming
language with transparently persistent data, concurrency control, data recovery,
associative queries, and other capabilities.
Object database management systems grew out of research during the early to mid-1980s
into having intrinsic database management support for graph-structured objects. The term
"object-oriented database system" first appeared around 1985.
Starting in 2004, object databases have seen a second growth period when open source
object databases emerged that were widely affordable and easy to use, because they are
entirely written in OOP languages like Java or C#, such as db4objects and Perst
(McObject).
Benchmarks between ODBMSs and relational DBMSs have shown that ODBMS can be
clearly superior for certain kinds of tasks. The main reason for this is that many
Page 1
Advanced RDBMS
operations are performed using navigational rather than declarative interfaces, and
navigational access to data is usually implemented very efficiently by following pointers.
Other things that work against ODBMS seem to be the lack of interoperability with a
great number of tools/features that are taken for granted in the SQL world including but
not limited to industry standard connectivity, reporting tools, OLAP tools and backup and
recovery standards. Additionally, object databases lack a formal mathematical
foundation, unlike the relational model, and this in turn leads to weaknesses in their query
support. However, this objection is offset by the fact that some ODBMSs fully support
SQL in addition to navigational access, e.g. Objectivity/SQL++ and Matisse. Effective
use may require compromises to keep both paradigms in sync.
In fact there is an intrinsic tension between the notion of encapsulation, which hides data
and makes it available only through a published set of interface methods, and the
assumption underlying much database technology, which is that data should be accessible
to queries based on data content rather than predefined access paths. Database-centric
thinking tends to view the world through a declarative and attribute-driven viewpoint,
while OOP tends to view the world through a behavioral viewpoint. This is one of the
many impedance mismatch issues surrounding OOP and databases.
Although some commentators have written off object database technology as a failure,
the essential arguments in its favor remain valid, and attempts to integrate database
functionality more closely into object programming languages continue in both the
research and the industrial communities
1.1 Objectives
The objective of this lesson is to learn the Object-Oriented database concepts with respect
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
to Object Identity, Object Structure, Object Databases Standards, Language and
Design and Overview of CORBA.
1.2 Content
Page 2
Advanced RDBMS
Through a Database Management System one can Insert, Update, Delete and View the
records in existing file
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
correspondence between a real-world object and its database representation.
The internal structure of an object in OOPLs includes the specification of instance
variables, which hold the values that define the internal state of the object.
An instance variable is similar to the concept of an attribute, except that instance
variables may be encapsulated within the object and thus are not necessarily
visible to external users
Some OO models insist that all operations a user can apply to an object must be
predefined. This forces a complete encapsulation of objects.
To encourage encapsulation, an operation is defined in two parts:
– signature or interface of the operation, specifies the operation name and
arguments (or parameters).
Page 3
Advanced RDBMS
Page 4
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
We use i1, i2, i3, . . . to stand for unique system-generated object identifiers. Consider the
following objects:
o1 = (i1, atom, ‘Houston’)
o2 = (i2, atom, ‘Bellaire’)
o3=(i3,atom,‘Sugarland’)
o4 = (i4, atom, 5)
o5 = (i5, atom, ‘Research’)
o6 = (i6, atom, ‘1988-05-22’)
Page 5
Advanced RDBMS
This example illustrates the difference between the two definitions for comparing
object states for equality.
o1 = (i1, tuple, <a1:i4, a2:i6>)
o2 = (i2, tuple, <a1:i5, a2:i6>)
o3 = (i3, tuple, <a1:i4, a2:i6>)
o4 = (i4, atom, 10)
o5 = (i5, atom, 10)
o6 = (i6, atom, 20)
In this example, The objects o1 and o2 have equal states, since their states at the atomic
level are the same but the values are reached through distinct objects o4 and o5.
However, the states of objects o1 and o3 are identical, even though the objects
themselves are not because they have distinct OIDs. Similarly, although the states of o4
and o5 are identical, the actual objects o4 and o5 are equal but not identical, because they
have distinct OIDs.
Page 6
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
TYPE_NAME: function, function, . . . , function
Example (1):
EMPLOYEE: Name, Address, Birthdate, Age, SSN, Salary, HireDate, Seniority
STUDENT: Name, Address, Birthdate, Age, SSN, Major, GPA
Page 7
Advanced RDBMS
OR:
EMPLOYEE subtype-of PERSON: Salary, HireDate, Seniority
STUDENT subtype-of PERSON: Major, GPA
Example (2): Consider a type that describes objects in plane geometry, which may be
defined as follows:
An alternative way of declaring these three subtypes is to specify the value of the Shape
attribute as a condition that must be satisfied for objects of each subtype:
Unstructured complex object: It is provided by a DBMS and permits the storage and
retrieval of large objects that are needed by the database application.
Typical examples of such objects are bitmap images and long text strings
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(documents); they are also known as binary large objects, or BLOBs for short.
This has been the standard way by which Relational DBMSs have dealt with
supporting complex objects, leaving the operations on those objects outside the
RDBMS
Structured complex object: It differs from an unstructured complex object in that the
object’s structure is defined by repeated application of the type constructors provided
by the OODBMS. Hence, the object structure is defined and known to the OODBMS.
The OODBMS also defines methods or operations on it.
Page 8
Advanced RDBMS
ODMG standard refers to - object model, object definition language (ODL), object
query language (OQL), and bindings to object-oriented programming languages.
An Object Model explains the data model upon which ODL and OQL are based. It also
provides data type and type constructors. SQL report describes a standard data model for
relational database.
Relation between an Object and literal is – a Literal has only a value but no object
identifier. An Object has four characteristics:
•identifier
•Name
•life time (persistent or not)
•Structure (how to construct)
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
c. OQL Entry Points and Iterator Variables .
Entry point is a named persistent object (for many queries, it is the name of the extent of
a class). An Iterator variable is used when a collection is referenced in OQL query.
Page 9
Advanced RDBMS
OQL Collection Operators include Aggregate operators such as: min, max, count, sum,
and avg.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
ODMG (Object Data Management Group)
ODMG 2.0 of the ODMG Standard differs from Release 1.2 in a number of ways.
With the wide acceptance of Java, we added a Java Persistence Standard in addition to
the existing Smalltalk and C++ ones. The ODMG object model is much more
comprehensive, added a meta object interface, defined an object interchange format,
and worked to make the programming language bindings consistent with the common
model. The changes made throughout the specification based on several years of
experience implementing the standard in object database products.
Page 10
Advanced RDBMS
As with Release 1.2, we expect future work to be backward compatible with Release
2.0. Although we expect a few changes to come, for example to the Java binding, the
Standard should now be reasonable stable.
Object Model. We have used the OMG Object Model as the basis for our model. The
OMG core model was designed to be a common denominator for object request
brokers, object database systems, object programming languages, and other
applications. In keeping with the OMG Architecture, we have designed an ODBMS
profile for the model, adding components (relationships) to the OMG core object
model to support our needs. Release 2.0 introduces a meta model.
The Object Data Management Group (ODMG) was a consortium of object database and
object-relational mapping vendors, members of the academic community, and interested
parties. Its goal was to create a set of specifications that would allow for portable
applications that store objects in database management systems. It published several
versions of its specification. The last release was ODMG 3.0. By 2001, most of the major
object database and object-relational mapping vendors claimed conformance to the
ODMG Java Language Binding. Compliance to the other components of the specification
was mixed. In 2001, the ODMG Java Language Binding was submitted to the Java
Community Process as a basis for the Java Data Objects specification. The ODMG
member companies then decided to concentrate their efforts on the Java Data Objects
specification. As a result, the ODMG disbanded in 2001.
Many object database ideas were also absorbed into SQL:1999 and have been
implemented in varying degrees in object-relational database products.
In 2005 Cook, Rai, and Rosenberger proposed to drop all standardization efforts to
introduce additional object-oriented query APIs but rather use the OO programming
language itself, i.e., Java and .NET, to express queries. As a result, Native Queries
emerged. Similarly, Microsoft announced Language Integrated Query (LINQ) and
DLINQ, an implementation of LINQ, in September 2005, to provide close, language-
integrated database query capabilities with its programming languages C# and VB.NET
9.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In February 2006, the Object Management Group (OMG) announced that they had been
granted the right to develop new specifications based on the ODMG 3.0 specification and
the formation of the Object Database Technology Working Group (ODBT WG). The
ODBT WG plans to create a set of standards that incorporates advances in object
database technology (e.g., replication), data management (e.g., spatial indexing), and data
formats (e.g., XML) and to include new features into these standards that support
domains in real-time systems where object databases are being adopted
Page 11
Advanced RDBMS
Lets take a look at something that comes closer to bearing a relationship to our everyday
programming. Whether you generate your applications or code them, somehow you need
a way to describe your object model. The goal of this Object Definition Language (ODL)
is to capture enough information to be able to generate the majority of most SMB web
apps directly from a set of statements in the language . . .
Here is a rough cut of ODL along with comments. This is very much a work in progress.
Now that I have a meta-grammar and a concrete syntax for describing languages, I can
start to write the languages I have been playing with. I will then build up to those
languages in the framework so that the framework can consume metadata that can be
transformed automatically from ODL, allowing for the automatic generation of most of
my code. Expect to see BIG changes in this grammar as I combine “top down” and
“bottom up” programming, write some real world applications and see where everything
meets in the middle!
Most importantly, we have objects that are comprised of 1..n attributes and that may or
may not have relationships. This is the high level UML model kind of stuff. Note that
ODL is describing functional metadata, so an object would be “Article” – not
“ArticleService” or “ArticleDAO” which are implementation decisions and would be
generated from the Article metadata automatically.
But before that we will digress into built-in functions supported in OQL The built-in
functions in OQL fall into the following categories:
Page 12
Advanced RDBMS
As you can see, most array operating functions accept boolean expression -- the
expression can refer to current object by it variable. This allows operating on arrays
without loops -- the built-in functions loop through the array and 'apply' the
expression on each element.
There is also built-in object called heap. There are various useful methods in heap
object.
Show referents that are not referred by another object. i.e., the referent is
reachable only by that soft reference:
Note that use of referrers built-in function to find the referrers of a given object.
because referrers returns an array, the result supports length property.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Let us refine above query. We want to find all objects that referred only by soft
references but we don't care how many soft references refer to it. i.e., we allow more
than one soft reference to refer to it.
Note that filter function filters the referrers array using a boolean expression. In the
filter condition we check the class name of referrer is not java.lang.ref.SoftReference.
Page 13
Advanced RDBMS
Now, if the filtered arrays contain atleast one element, then we know that f.referent is
referred from some object that is not of type java.lang.ref.SoftReference!
Find all finalizable objects (i.e., objects that are some class that has
'java.lang.Object.finalize()' method overriden)
How does this work? When an instance of a class that overrides finalize() method is
created (potentially finalizable object), JVM registers the object by creating an
instance of java.lang.ref.Finalizer. The referent field of that Finalizer object refers to
the newly created "to be finalized" object. (dependency on implementation detail!)
Find all finalizable objects and approximate size of the heap retained because of
those.
Certainly this looks really complex -- but, actually it is simple. The JavaScript object
literal used to select multiple values in the select expression (obj and size properties).
reachables finds objects reachable from given object. map creates a new array from
input array by applying given expression on each element. The map call in this query
would create an array of sizes of each reachable object. sum built-in adds all elements of
array. So, we get total size of reachable objects from given object (f.referent in this case).
Why do I say approximate size? HPROF binary heap dump format does not account for
actual bytes used in live JVM. Instead sizes just enough to hold the data are used. For eg.
JVMs would align smaller data types such as 'char' -- JVMs would use 4 bytes instead of
2 bytes. Also, JVMs tend to use one or two header words with each object. All these are
not accounted in HPROF file dump. HPROF uses minimal size needed to hold the data -
for example 2 bytes for a char, 1 byte for a boolean and so on
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
1.2.8 Overview of C++ Language Binding
The C++ binding to ODBMSs includes a version of the ODL that uses C++ syntax, a
mechanism to invoke OQL, and procedures for operations on databases and transactions
The Object Definition Language (ODL) is the declarative portion of C++ ODL/OML.
The C++ binding of ODL is expressed as a library that provides classes and functions
to implement the concepts defined in the ODMG object model. OML is a language
used for retrieving objects from the database and modifying them. The C++ OML
syntax and semantics are those of standard C++ in the context of the standard class
library.
Page 14
Advanced RDBMS
ODL/OML specifies only the logical characteristics of objects and the operations used
to manipulate them. It does not discuss the physical storage of objects. It does not
address the clustering or memory management issues associated with the stored
physical representation of objects or access structures. In an ideal world, these would
be transparent to the programmer. In the real world, they are not. An additional set of
constructs called "physical pragmas" is defined to give the programmer some direct
control over these issues, or at least to enable a programmer to provide "hints" to the
storage management subsystem provided as part of the ODBMS run time. Physical
pragmas exist within the ODL and OML. They are added to object type definitions
specified in ODL, expressed as OML operations, or shown as optional arguments to
operations defined within OML.
These pragmas are not in any sense stand-alone languages, but rather a set of
constructs added to ODL/OML to address implementation issues.
The ODMG Smalltalk binding is based upon two principles -- it should bind to
Smalltalk in a natural way that is consistent with the principles of the language, and it
should support language interoperability consistent with ODL specification and
semantics. We believe that organizations specifying their objects in ODL will insist
that the Smalltalk binding honor those specifications. These principles have several
implications that are evident in the design of the binding:
There is a unified type system that is shared by Smalltalk and the ODBMS.
This type system is ODL as mapped into Smalltalk by the Smalltalk binding.
The binding respects the Smalltalk syntax, meaning the Smalltalk language
will not have to be modified to accommodate this binding.
ODL concepts will be represented using normal Smalltalk coding conventions.
The binding respects the fact that Smalltalk is dynamically typed. Arbitrary
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The binding respects the dynamic memory-management semantics of
Smalltalk. Objects will become persistent when they are referenced by other
persistent objects in the database, and will be removed when they are no longer
reachable in this manner.
As with other language bindings, ODMG Java binding is based on one fundamental
principle -- the programmer should perceive the binding as a single language for
Page 15
Advanced RDBMS
expressing both database and programming operations, not two separate languages
with arbitrary boundaries between them. This principle has several corollaries:
There is a single, unified type system shared by the Java language and the
The binding respects the Java language syntax, meaning that the Java language
The Java binding provides persistence by reachability, like the ODMG Smalltalk
binding (this has also been called "transitive persistence"). On database commit, all
objects reachable from database root objects are stored in the database.
The Java binding provides two ways to declare persistence-capable Java classes:
We want a binding that allows all of these possible implementations. Because Java
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
does not have all the hooks we might desire, and the Java binding must use standard
Java syntax, it is necessary to distinguish special classes understood by the database
system. These classes are called persistence-capable classes. They can have both
persistent and transient instances. Only instances of these classes can be made
persistent. The current version of the standard does not define how a Java class
becomes a persistence-capable class.
Page 16
Advanced RDBMS
Page 17
Advanced RDBMS
The Common Object Request Broker Architecture (or CORBA) is an industry standard
developed by the Object Management Group (OMG) to aid in distributed objects
programming. It is important to note that CORBA is simply a specification. A CORBA
implementation is known as an ORB (or Object Request Broker). There are several
CORBA implementations available on the market such as VisiBroker, ORBIX, and
others. JavaIDL is another implementation that comes as a core package with the JDK1.3
or above.
Similar to RMI, CORBA objects are specified with interfaces. Interfaces in CORBA,
however, are specified in IDL. While IDL is similar to C++, it is important to note that
IDL is not a programming language. For a detailed introduction to CORBA
There are a number of steps involved in developing CORBA applications. These are:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Define an interface in IDL
Map the IDL interface to Java (done automatically)
Implement the interface
Develop the server
Develop a client
Run the naming service, the server, and the client.
We now explain each step by walking you through the development of a CORBA-based
file transfer application, which is similar to the RMI application we developed earlier in
this article. Here we will be using the JavaIDL, which is a core package of JDK1.3+.
Page 18
Advanced RDBMS
When defining a CORBA interface, think about the type of operations that the server will
support. In the file transfer application, the client will invoke a method to download a
file. Code Sample 5 shows the interface for FileInterface. Data is a new type introduced
using the typedef keyword. A sequence in IDL is similar to an array except that a
sequence does not have a fixed size. An octet is an 8-bit quantity that is equivalent to the
Java type byte.
Note that the downloadFile method takes one parameter of type string that is declared in.
IDL defines three parameter-passing modes: in (for input from client to server), out (for
output from server to client), and inout (used for both input and output).
Code Sample
FileInterface.idl interface FileInterface
{
typedef sequence<octet> Data;
Data downloadFile(in string fileName);
};
Once you finish defining the IDL interface, you are ready to compile it. The JDK1.3+
comes with the idlj compiler, which is used to map IDL definitions into Java declarations
and statements.
The idle compiler accepts options that allow you to specify if you wish to generate client
stubs, server skeletons, or both. The -f<side> option is used to specify what to generate.
The side can be client, server, or all for client stubs and server skeletons. In this example,
since the application will be running on two separate machines, the -fserver option is
used on the server side, and the -fclient option is used on the client side.
Now, let's compile the FileInterface.idl and generate server-side skeletons. Using the
command:
Page 19
Advanced RDBMS
Page 20
Advanced RDBMS
orb.resolve_initial_references("NameService");
NamingContext ncRef = NamingContextHelper.narrow(objRef);
// Bind the object reference in naming
NameComponent nc = new NameComponent("FileTransfer", " ");
NameComponent path[] = {nc};
ncRef.rebind(path, fileRef);
System.out.println("Server started....");
// Wait for invocations from clients
java.lang.Object sync = new java.lang.Object();
synchronized(sync){
sync.wait();
}
} catch(Exception e) {
System.err.println("ERROR: " + e.getMessage());
e.printStackTrace(System.out);
}
}
}
Once the FileServer has an ORB, it can register the CORBA service. It uses the COS
Naming Service specified by OMG and implemented by Java IDL to do the registration.
It starts by getting a reference to the root of the naming service. This returns a generic
CORBA object. To use it as a NamingContext object, it must be narrowed down (in other
words, casted) to its proper type, and this is done using the statement:
d. Develop a client
The next step is to develop a client. An implementation is shown in Code Sample 3. Once
a reference to the naming service has been obtained, it can be used to access the naming
service and find other services (for example the FileTransfer service). When the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
FileTransfer service is found, the downloadFile method is invoked.
Page 21
Advanced RDBMS
try {
// create and initialize the ORB
ORB orb = ORB.init(argv, null);
// get the root naming context
org.omg.CORBA.Object objRef =
orb.resolve_initial_references("NameService");
NamingContext ncRef = NamingContextHelper.narrow(objRef);
NameComponent nc = new NameComponent("FileTransfer", " ");
// Resolve the object reference in naming
NameComponent path[] = {nc};
FileInterfaceOperations fileRef =
FileInterfaceHelper.narrow(ncRef.resolve(path));
if(argv.length < 1) {
System.out.println("Usage: java FileClient filename");
}
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Running the CORBA naming service. This can be done using the command tnameserv.
By default, it runs on port 900. If you cannot run the naming service on this port, then
you can start it on another port. To start it on port 2500, for example, use the following
command:
Page 22
Advanced RDBMS
Generate Stubs for the client. Before we can run the client, we need to generate stubs for the client. To do that, get a
copy of the FileInterface.idl file and compile it using the idlj compiler specifying that you wish to generate client-
side stubs, as follows:
Making a selection between these two distribution mechanisms really depends on the
project at hand and its requirements. I hope this article has provided you with enough
information to get started developing distributed object-based applications and enough
guidance to help you select a distribution mechanism
Page 23
Advanced RDBMS
With support for CORBA and IIOP, the ValidSolution allows you to create client/server
Web applications that take advantage of the web objects and application services. In
addition, you can now access back-end relational databases for enhanced data integration
using the Enterprise Connection Services.
Valid Components can leverage the Enterprise Connection Services (ECS) for building
live links between pages and forms, to data from relational databases. To set up the links,
you simply use the ECS template application to identify your forms and fields that will
contain external source data, and to define the real-time connection settings. You can set
up connections for DB2, Oracle, Sybase, EDA/SQL, and ODBC.
The Domino Application Server also allows you to design applications with CORBA-
standard distributed objects
1.2.9 Object relational and Extended Relational Database Systems Evolution &
Current trends of Database Technology
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
features provided by the database model provide adequate security for the intended
application? Does the implementation of the security controls add an unacceptable
amount of computational overhead? In this paper, the security strengths and weaknesses
of both database models and the special problems found in the distributed environment
are discussed.
As distributed networks become more popular, the need for improvement in distributed
database management systems becomes even more important. A distributed system varies
from a centralized system in one key respect:
Page 24
Advanced RDBMS
The data and often the control of the data are spread out over two or more geographically
separate sites. Distributed database management systems are subject to many security
threats additional to those present in a centralized database management system (DBMS).
Furthermore, the development of adequate distributed database security has been
complicated by the relatively recent introduction of the object-oriented database model.
This new model cannot be ignored. It has been created to address the growing complexity
of the data stored in present database systems.
For the past several years the most prevalent database model has been relational. While
the relational model has been particularly useful, its utility is reduced if the data does not
fit into a relational table. Many organizations have data requirements that are more
complex than can be handled with these data types. Multimedia data, graphics, and
photographs are examples of these complex data types.
Relational databases typically treat complex data types as BLOBs (binary large objects).
For many users, this is inadequate since BLOBs cannot be queried. In addition, database
developers have had to contend with the impedance mismatch between the third
generation language (3GL) and structured query language (SQL). The impedance
mismatch occurs when the 3GL command set conflicts with SQL. There are two types of
impedance mismatches: (1) Data type inconsistency: A data type recognized by the
relational database is not recognized by the 3GL. For example, most 3GLs don’t have a
data type for dates. In order to process date fields, the 3GL must convert the date into a
string or a Julian date. This conversion adds extra processing overhead. (2) Data
manipulation inconsistency: Most procedural languages read only one record at a time,
while SQL reads records a set at a time. This problem is typically overcome by
embedding SQL commands in the 3GL code. Solutions to both impedance problems add
complexity and overhead. Object-oriented databases have been developed in response to
the problems listed above: They can fully integrate complex data types, and their use
eliminates the impedance mismatch [Mull94].
In this paper, we will review the security concerns of databases in general and distributed
databases in particular. We will examine the security problems found in both models, and
we will examine the security problems unique to each system. Finally, we will compare
the relative merits of each model with respect to security.
Page 25
Advanced RDBMS
While Oracle and Sybase come to mind first when thinking of relational database
technology for the Unix platform, Informix Corp. claims the largest installed base of
relational database engines running on Unix. (See "Informix on the Move," DBMS,
November 1995, page 46.) Furthermore, Informix appears to be focused more
specifically on a mission statement to deliver "... the best technology and services for
developing enterprisewide data management applications for open systems." Something
must be working right. Informix's 1995 revenue ($709 million) and net income ($105.3
million) are up by more than 50 percent and 59 percent, respectively, compared to 1994.
This puts Informix on track to join the ranks of other billion dollar software businesses
within the next year or two.
Founded in 1980 by Roger Sippl, Informix went public in 1986 and released its current
top-of-the-line product, the OnLine Dynamic Server RDBMS, in 1988. While the current
Informix product line reflects a focus on database servers and tools, Informix has always
encouraged a healthy applications market founded on the use of its tools and server
engines. Whereas Oracle developed its own line of accounting and distribution
applications, Informix left this to third parties. Both FourGen Software (Seattle, Wash.)
and Concepts Dynamic (Schaumburg, Ill.), among others, have developed full accounting
application suites based on the Informix RDBMS and built with Informix development
tools.
The only time Informix diverted from its database-centric strategy was in 1988, when it
merged with Innovative Software, adding the SmartWare desktop applications suite to its
database-centric product line. This product acquisition, together with that of the Wingz
graphical spreadsheet, followed a pattern similar to Novell's later acquisition of
WordPerfect's desktop business. Both companies, Informix and Novell, moved into
businesses that they did not understand and eventually divested the products they
acquired. Also, just as the WordPerfect acquisition triggered the departure of Novell
founder Ray Noorda, the SmartWare acquisition triggered the departure of Roger Sippl
from Informix.
Both Informix and Novell subsequently refocused on their core businesses as a result of
these forays into desktop applications. The current chairman, president, and CEO of
Informix, Phillip E. White, joined the company in 1989. He took over in 1992 from
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Roger Sippl, who left to found Visigenic, a database access company focused on ODBC
technology. White is credited with increasing shareholder value from 56 cents per share
at the end of 1990 to $30 per share at the end of 1995. This performance placed Informix
at the top of the Wall Street Journal's Shareholder Scoreboard for best five-year
performer.
Without the opportunity to grow revenues through diversifying into applications or other
non-database areas, Informix could face difficulties in sustaining its growth.
Consequently, Informix is pursuing a number of strategies to strengthen and differentiate
its core database products in order to reach new markets. These strategies include:
Page 26
Advanced RDBMS
* increasing the range of data types that Informix RDBMS engines can handle
DSA is the marketing term for a database architecture designed to position Informix as a
leading provider in the area of parallel processing and scalable database server
technology. DSA provides a foundation for a range of high-end Informix database servers
based on variants of the same core engine technology:
* The OnLine Extended Parallel Server is designed for very high-volume OLTP
environments that need to utilize loosely coupled or shared-nothing computing
architectures composed of clusters of symmetrical multiprocessing (SMP) or massively
parallel processing (MPP) systems.
* The Online Dynamic Server is designed for high-volume OLTP environments that
require replication, mainframe-level database administration tools, and the performance
delivered by Informix's parallel data query technology (PDQ). PDQ enables parallel table
scans, sorts, and joins, parallel query aggregation for decision support and parallel data
loads, index builds, backups, and restores. Although this server supports SMP it does not
support MPP, which is the essential differentiating feature between the OnLine Dynamic
Server and the OnLine Extended Parallel Server.
* The OnLine Workgroup Server is designed for smaller numbers of user connections (up
to 32 concurrent) and lower transaction volumes. It is also easier to administer because it
offers less complex functionality compared to the higher-end servers.
These three server products position Informix to compete effectively against similar
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
stratified server families from Oracle, IBM, and Sybase, as well as niche players such as
Microsoft with its SQL Server product and Computer Associates with CA-OpenIngres.
However, while IBM may lead with the exceptional database administration breadth and
depth of its DB2 engine or Microsoft with the ease of use of its graphical administration
tools, Informix is setting the pace in support for parallel processing that addresses an
issue dear to every database users' heart, namely performance.
Informix-Universal Server
Informix has supported binary large object (BLOB) data for many years but the company
recognizes that the need to store, and more important, to manipulate complex data other
Page 27
Advanced RDBMS
than text and numeric data, will be critical to its ability to address future customer needs.
For this reason, Informix recently completed its acquisition of Illustra Information
Technologies, founded by Ingres RDBMS designer Dr. Michael Stonebraker. Illustra
specializes in handling image, 2D and 3D spatial data, time series, video, audio, and
document data using snap-in modules called DataBlades that add object handling
capabilities to an RDBMS via extensions to SQL. Informix has announced its intention to
fully integrate Illustra technology into a new Informix-Universal Server product within
the next year.
If Informix manages this task, and analysts such as Richard Finkelstein of Performance
Computing doubt that it will (see Computerworld, February 12, 1996), Informix-
Universal Server could put Informix in a unique position to service specialized and
highly profitable markets such as:
Establishing an early leadership position in any one of these markets could easily account
for another billion dollars in revenue for Informix. This would surely justify the time and
cost required to rearchitect its core engine around the Illustra technology and position
Informix as a player in the object/relational database market.
2. delivery of a DataBlades Developer Tool Kit for creating new user-defined data types
that work in both the Illustra Server and the new Informix-Universal Server (the second
ANNAMALAI
ANNAMALAI UNIVERSITY
quarter of 1996)
UNIVERSITY
3. delivery of the fully merged Informix-Universal Server technology including "snap in"
DataBlades (the fourth quarter of 1996)
a. Riding Waves
To some extent, you could argue that Informix (like competitors Oracle and Sybase) has
surfed the technology wave of relational databases and Unix-based open systems that has
swept across corporations over the last decade. Another more recent wave, data
warehousing, is far from peaking, and Informix hedged its bets in this area with its
Page 28
Advanced RDBMS
Oracle and Sybase have also taken initiatives in this area and are integrating OLAP
technology into their product lines to ensure that they lose as few possible sales to
multidimensional server vendors such as Arbor Software (Sunnyvale, Calif.), which sells
the Essbase Analysis Server, or to specialized data warehouse server vendors such as Red
Brick Systems (Los Gatos, Calif.). The data warehousing wave provides database
vendors the chance to offer an application that is no more than their current database
engine and some combination of front-end query and reporting tools. The data warehouse
solution from Informix also benefits from its built-in parallel processing functionality and
log-based "continuous" data replication services for populating the data warehouse from
other Informix servers. Leading U.K. database analysts Bloor Research Group cited
Informix's DSA as "the best all-round parallel DBMS on the market" and claimed it "has
significant benefits over almost all its competitors on data warehouse applications"
("Parallel Database Technology: An Evaluation and Comparison of Scalable Systems,"
Bloor Research Group, October 1995).
b. Going Mobile
International Data Corp. forecasts suggest that shipments of laptop computers will grow
from four million in 1995 to some eight million in 1999 in the U.S. alone. In other words,
the road warrior population is set to at least double, and as more workers telecommute
and the influence of the Internet makes itself felt in the business world, the term "office"
will simply come to mean "where you are at this point in time." To support this scenario,
Informix is working on its "anytime, anywhere" strategy, which sounds suspiciously
similar to the concepts espoused by Sybase for its SQL Anywhere server product based
on the recently acquired Watcom SQL engine.
However, the key to Informix's strategy for the mobile computing market is
asynchronous messaging based on new middleware products being built by Informix that
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
provide store-and-forward message delivery and the use of software agents to manage the
process. Asynchronous messaging lets mobile clients send and receive messages without
maintaining a constant connection with the server. Store-and-forward message delivery
ensures that messages get sent or completed as soon as a connection is established or
reestablished. The middleware and software agents are used to establish and maintain
connections, to automate repetitive tasks, and to intelligently sort and save information.
The applications that deliver this functionality can be created using the Informix class
libraries built in the Informix NewEra tool, which allows for application partitioning to
deploy components on mobile clients or servers.
Page 29
Advanced RDBMS
NewEra is Informix's rapid application development tool that competes with Powersoft's
(a Sybase company) PowerBuilder and Oracle's Developer 2000. Compared to its
competitors, NewEra benefits from a strong object-oriented design that delivers a
repository-based, class library-driven application development paradigm using class
browsers for navigating application objects. NewEra can also generate cross-platform
applications. Specifically, NewEra includes:
* reusable class libraries that can be Informix or third party provided or developer
defined
The impending release of the latest version of NewEra, expected in the second quarter of
1996, is slated to deliver user-defined application partitioning for three-tier client/server
deployment; OLE remote automation server support to allow OLE clients to make
requests against NewEra built application servers; and class libraries to support
transaction-processing monitors for load balancing of high volume OLTP applications. If
this functionality is delivered as promised, then client/server application vendors such as
Concepts Dynamic (Schaumburg, Ill.), whose Control suite of accounting applications is
written in NewEra, will benefit from their use of Informix technology.
Informix, like everyone these days, is hot on the Web word. World Wide Web Interface
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Kits are available for use by Informix customers building Web applications using
Informix-4GL or Informix-ESQL/C tools that need to use the common gateway interface
(CGI) as a means to access Informix databases across the Internet. Informix has
established a Web partner program to build links with other Web software developers
such as Bluestone Inc.(Mountain View, Calif.) and Spider Technologies (Palo Alto,
Calif.). Informix customers such as MCI, Choice Hotels, and the Internet Shopping
Network are already forging ahead with Informix-based Web solutions. Illustra (now
owned by Informix) also recently collaborated with other partners to deliver "24 Hours in
Cyberspace." This event, claimed to be the largest online publishing event ever staged,
allowed the organizers to create a new web page every 30 minutes comprising
Page 30
Advanced RDBMS
multimedia content delivered from hundreds of sites worldwide and stored in an Illustra
DBMS.
Informix also partnered with Internet darling Netscape Communications Corp. to include
the Informix-OnLine Workgroup Server RDBMS as the development and deployment
database for Netscape's LiveWire Pro. The LiveWire Pro product is part of Netscape's
SuiteSpot Web application development system for building online applications. This
deal involves cross-licensing and selling of Informix and Netscape products and is likely
to be among the first of many such collaborations between database and Internet vendors
during 1996.
While the IPC vs. PC debate rages on in the press, let me put a spin on this scenario for
you. You are a road warrior and before leaving on a trip you slip your personal profile
SmartCard (PPS) into your jacket pocket and leave the laptop at home. Your PPS
contains your personal login information and access numbers for Internet and Intranet
connectivity. Eventually this PPS may also be software agent-trained to search for news
on specific subjects, and may contain a couple of Java applets for corporate Intranet
application front ends to submit your T&E (travel and entertainment) and review your
departmental schedule. When you check into your room, there is an IPC designed
specifically for OLIP (online Internet processing).
This IPC, which costs your hotel the same amount as the TV in your room, is a combined
monitor, PPS reader, and keyboard/mouse already plumbed into the Internet. You switch
on the IPC and with one swipe of your PPS in the reader you upload all your profile data
into the IPC's local memory. While this is taking place, the hotel uses the opportunity to
display its home page, welcoming you to the hotel, advertising goods and services, and, if
you are a regular guest, showing you your current bill and your frequent guest program
status. You then fire up your favorite browser to process some email, set your software
agent off to collect the news, submit your trip expenses to the home office Intranet, and
review your current schedule to book a few calls and juggle some appointments. All of
this was done without a laptop or personal computer in sight and depends only on a
simple device connected to the Internet and a SmartCard.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SmartCards are another technology on which Informix is working together with its
partners, Hewlett-Packard (Palo Alto, Calif.) and GemPlus Card International Corp.
(Gaithersburg, Md.). SmartCards will be used for all sorts of applications including
buying, identifying, and securing things. It is not hard to see SmartCards being carried by
everyone and combining your credit card, phone card, driver's license, and medical alert
data onto one slim "plastic" database.
It's hard to see Informix taking a wrong step at the moment. The positioning of the
Informix-Universal Server, the complementary strategies of mobile computing, Web-
Page 31
Advanced RDBMS
enabling, and SmartCards show some good, focused vision. Phillip White's record, as
well as that of on-staff database gurus such as Dr. Michael Stonebraker of Ingres/Illustra
fame and Mike Saranga of DB2 fame, all shows the proven ability to execute these
strategies successfully. Sounds like a recipe for success to me.
Oracle 8i server server software has many optional components to chose from
The Oracle 8i server software
Net8 Listener
The Oracle8i utilities
SQL * Plus
A starter database
Object spatial helps to data mapping and handling.
An instance acan be started and open a database in restricted mode so that the database is
available only to administration personnel. This mode helps to accomplish the following
tasks.
Perform structure maintenance, such as rebuilding indexes.
Perform an export or import of database data
Perform a data load with SQL * Loader
Temporarily prevent typical users from using data.
Page 32
Advanced RDBMS
Like C++, Oracle 8 provides built in constructors for values of a declared type and these
constructors bear the name of the type. Thus, the word point type and a parenthesized list
of appropriate values form a value of type point type.
One of the most important parts of an Oracle database is its data dictionary. Data
Dictionary is a read-only set of tables that provide information about its associated
database. Dynamic performance tables are not true tables, and most users should not
access them. However, database administrators can query and create views on the tables
and grant access to those views to other users . These views are sometimes called fixed
views because they cannot be altered or removed by the database administrator.
The nested relational data model is a natural generalisation of the relational data model,
but it often leads to designs which hide the data structures needed to specify queries and
updates in the information system. The relational data model on the other hand exposes
the specifications of the data structures and permits the minimal specification of queries
and updates using SQL. The deficiencies in relational systems leading to a demand for
object-oriented nested relational solutions are seen to be deficiencies in the
implementations of relational database systems, not in the data model itself. The nested
relational data model is a natural generalisation of the relational data model, but it often
leads to designs which hide the data structures needed to specify queries and updates in
the information system. The relational data model on the other hand exposes the
specifications of the data structures and permits the minimal specification of queries and
updates using SQL. However, there are deficiencies in relational systems, which lead to a
demand for object-oriented nested relational solutions. This paper argues that these
deficiencies are not inherent in the relational data model, but are deficiencies in the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
implementations of relational database systems.
The paper first sketches how the nested-relational model is a natural extension of the
object-relational data model, then shows how the nested relational model, while sound, is
expensive to use. It then examines the object-oriented paradigm for software engineering,
and shows that it gives very little benefit in database applications. Rather, the relational
model as represented in conceptual modeling languages is argued to provide an ideal
view of the data. The ultimate thesis is that a better strategy is to employ a main-memory
relational database optimised for queries on complex objects, with a query interface
based on a conceptual model query language. Object-relational data model leads to
nested relations
Page 33
Advanced RDBMS
The object-relational data model (Stonebraker, Brown and Moore 1999) arises out of the
realisation that the relational data model abstracts away from the value sets of attribute
functions. If we think in terms of tuple identifiers in relations (keys), then a relation is
simply a collection of attribute functions mapping the key into value sets.
The pure relational data model is based on set theory, and operates in terms of
projections, cartesian products and selection predicates. Cartesian product simply creates
new sets from existing sets, while projection requires the notion of identity, since the
projection operation can produce duplicates, which must be identified. Selection requires
the concept of a predicate, but the relational model abstracts away from the content of the
predicate, requiring only a function from a tuple of value sets into {true, false}. The
relational system requires only the ability to combine predicates using the prepositional
calculus.
Particular value sets have properties which are used in predicates and in other operations.
The only operator used in the pure relational model is identity. The presence of this
operator is guaranteed by the requirement that the value sets be sets, although in practice
some value sets do not for practical purposes support identity (eg real number represented
as floating point).
This realisation that the relational data model abstracts away from the types of value sets
and from the operators which are available to types has allowed the design of database
systems where the value sets can be of any type. Besides integers, strings, reals, and
booleans, object-relational databases can support text, images, video, animation,
programs and many other types. Each type supports a set of operations and predicates
which can be integrated with the relational operations into practical solutions (each type
is an abstract data type).
If a value set can be of any type, why not a set of elements of some type? Why not a
tuple? If we allow sets and tuples, then why not sets of tuples? Sets of tuples are relations
and the corresponding abstract data type is the relational algebra. Thus the object
elational data model leads to the possibility of relation-valued attributes in relations.
Having relation-valued attributes in relations looks as if it might violate first normal
form. However, the outer relational operations can only result in tuples whose attribute
values are either copies of attribute values from the original relations or are functions of
those values, in the same way as if the value sets were integers, the results are either the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
integers present in the original tables or functions like square root of those integers. In
other words, the outer relational system can only see inside a relation-valued attribute to
the extent that a function is supplied to do so. These functions are particular to the
schema of the relation-valued attribute, and have no knowledge of the outer schema.
Since the outer relational model and the abstract data type of a relation-valued attribute
re the same abstract data type, it makes sense to introduce a relationship among the two.
The standard relationships are unnest and nest. Unnest is an operator which modifies the
scheme of the outer data model, replacing the relation-valued attribute function by a
ollection of attribute functions corresponding to the scheme of the inner relation. Nest is
Page 34
Advanced RDBMS
Having relation-valued attributes together with nest and unnest operations between the
outer and inner relational systems is called the nested relational data model. We see that
the nested relational data model is a natural extension of the object-relational data model.
In recent years the object-oriented model has become the dominant programming model
and is becoming more common in systems design, including information systems. The
data in an object-oriented system consists typically of complex data structures built from
tuple and collector types. The tuple type is the same as the tuple type in the
bjectrelational model. A collector type is either a set, list or multiset. The latter two can
be seen as sets with an additional attribute: a list is a set with a sequence attribute, while a
multiset is a set with an additional identifying attribute. So a nested-relational data model
can represent data from an object-oriented design. Accordingly, object-relational
databases with object-relational nested SQL can be used to implement object-oriented
databases. How this is done is described for example by Stonebraker , Brown and Moore
(1999) (henceforth SBM). We should note that both the relational and object-oriented
data models are implementations of more abstract conceptual data models expressed in
conceptual data modelling languages such as the Entity-Relationship-Attribute (ERA)
method. Wellesta lished information systems design methods begin the analysis of data
with a conceptual model, moving to a particular database implementation at a later stage.
An example adapted from SBM will clarify some issues.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Figure 1. Since the relationship between department and vehicle is one-to-many,
associated with each department is a set of vehicles.
Page 35
Advanced RDBMS
abstract data types supporting the value sets. In particular, extending and overloading the
dot notation for disambiguating attribute names support nested relational systems.
For example, in
Select ID from dept where car.year = 1999 (2)
Dot year identifies the year attribute of the car tuple, and also designates the membership
of a tuple where year = 1999 in the set of tuples which is the value set of dept.car. The
result of this query on the table of Figure 2 is ID = 1.
As a consequence of this overloading, the and boolean operator in the WHERE clause
becomes if not ambiguous, at least counterintuitive to someone used to standard SQL.
Select ID from dept where car.year = 1999 and car.make = Laser (4)
The and operator is interpreted as set intersection, and the result is also ID = 1.
This result, although correct, is probably not what the maker of the query intended. They
would more likely have been looking for a department, which has a 1999 Laser, and the
response they would be looking for would be none.
There are two ways to fix this problem. One is to import a new and operator from the
relational ADT, so that (4) becomes
Select ID from dept where car.year = 1999 and2 car.make = Laser (5)
In this solution, both arguments of and2 must be the same relation-valued attribute of the
outer system.
The other solution is to unnest the table so that the standard relational operator works in
the way it does in standard SQL
The same sort of problem occurs when we try to correlate the SELECT clause with the
WHERE clause
Page 36
Advanced RDBMS
when applied to the table of Figure 2, as a consequence of first normal form. We need
again to use unnest to convert the nested structure to a flat relational structure in order to
make the query mean what we want to say. Although OR SQL is a sound and complete
query language, the simple-looking queries tend to be not very useful, and in order to
make useful queries additional syntax and a good understanding of the possibly complex
and possibly multiple nesting structure is essential. The author’s experience is that it is
very hard to teach, even to very advanced students.
There are several different ways to implement this application in the nested relational
model, taking each of the entities as the outermost relation. If implemented as a single
table, two of the entities would be stored redundantly because of the many-to-many
relationships. So the normalised way is to store the relationships as sets of reference types
(attributes whose value sets are object identifiers).
If the query follows the nesting structure used in the implementation, then we have only
the problems of correlation of various clauses in the SQL query described in the last
section.
However, if the query does not follow the nesting structure, it can get very complex. For
example, if the table has a set of courses associated with each student and a set of
lecturers associated with each course, then in order to find the students associated with a
given lecturer, the whole structure needs to be unnested, and done so across reference
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
types. The query is hard to specify, and would be very complex to implement.
One might argue that one should not use the nested relational model for many to many
relationships. But nested systems can interact, as in Figure 3.
Page 37
Advanced RDBMS
In this case, an event has a set of races, and a team has a set of competitors, and we have
to decide whether a race has a set of references to competitor or vice versa. What if we
want to find what events a team participates in? The whole structure must be unnested.
The point is that representing these commonly occurring complex data structures using a
nested relational model are very much more complex then representing them in the
standard relational model.
Using the NR model forces the designer to make more choices at the database
schema level than if the standard relational model is used.
A query on a NR model must include navigation paths.
A query must often unnest complex structures, often very deeply for even
semantically simple queries.
So even though the nested relational model is sound, it is very much more difficult to use
than the standard relational model, so may be thought of as much more expensive to use.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In order for a more expensive tool to be a sound engineering choice, there must be a
corresponding benefit. Let us therefore look at the benefits of the object-oriented
programming model.
Page 38
Advanced RDBMS
Let us see how this applies to the specification of data in an information system. As we
have seen, it is common to use a conceptual modelling technique to specify such data.
The implementation of this data is ultimately in terms of disk
addresses, file organisations and access methods, but is generally done in several stages.
Further stages of implementation are performed almost entirely within the database
manager software (DBMS), sometimes with the guidance of a database administrator
who will identify attributes of tables which need rapid access, or give the DBMS some
parameters which it will use to choose among pre-programmed design options. In effect,
the implementation of the data model is almost entirely automated, and generally not the
concern of the applications programmer.
So the conceptual data model is a specification, the almost equivalent DBMS table
schemas are in effect also specifications, and the programmer does not generally proceed
further with refinement.
The SQL statement is at a very high level, and is generally also refined in several stages:
But, again, these refinement decisions are made by the DBMS using pre-programmed
design decisions depending on statistics of the tables held in the system catalog and to a
degree on parameters supplied by the database administrator. The programmer is
generally not concerned with them.
So it makes sense to think of an SQL statement not as a program but as a specification for
a program. It is hard to see what might be removed from an SQL statement while
Page 39
Advanced RDBMS
retaining the same specified result. The SELECT clause determines which columns are to
appear in the result, the FROM clause determines which tables to retrieve data from (in
effect which entities and relationships the data is to come from), and the WHERE clause
determines which rows to retrieve data from.
We have that the benefits of information hiding in object-oriented design is that the
programmer can work with the specifications of the data and methods of a system
without having to worry about how the specifications are implemented. However, in
information systems, the programmer works only with specifications of data structures
and access/ update methods. The implementation is hidden already in the DBMS. So in a
DBMS environment the programmer never has to worry how the specifications are
implemented. Information hiding is already employed no matter what design method the
programmer uses.
What the nested relational data model does is hide aspects of the structure of the specified
data, whereas the standard relational model exposes the specified structure of the data.
Using the NR data model, the data designer must make what amount to packaging design
decisions in the implementation of a conceptual model. In this sense, a NR model is more
refined than a standard relational model, and is therefore more expensive to build. On the
other hand, when a query is planned, in the NR model the programmer, besides
specifying the data that is to appear in the query, must also specify how to unpackage the
data to expose sufficient structure to specify the result. So as we have seen, the query is
also more expensive. Both the data representation and the query are unnecessarily more
expensive than the standard relational representation, since the information being hidden
is part of the specification, not how the specifications are implemented.
One might ask why people don’t already use relational databases for problems calling for
object-oriented approaches. The usual reason given is that RDBs are too slow. The
paradigmatic object-oriented application is system design, say a VLSI design or the
design of a large software system. There is often only one (very complex) object in the
system. This object has many parts, which are themselves complex. A relational
implementation therefore calls for many subordinate tables with limited context; and
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
processing data in the application generally requires large numbers of joins.
Rejection of the standard relational data model for these applications is therefore not a
rejection of the model per se, but a recognition that current implementations of the
standard relational data model do not perform well enough for these problems.
Page 40
Advanced RDBMS
Two problems have been identified which make the standard relational model difficult to
use for OO applications: the slowness of the implementation and the necessity for the
definition of a large number of tables with limited context.
The former problem is technical. A large amount of investment has been made in the
design of implementations for transaction-oriented applications. Given sufficient
effective demand, there is no reason why a sufficient investment can not be made for
applications of the OO type. In particular, there are already relational database systems
optimised around storage of data primarily in main memory rather than on disk. For
example, a research project of National Research Institute for Mathematics and Computer
Science in the Netherlands together with the Free University of Amsterdam, called
Monet, has published a number of papers on the various design issues in this area. A
search on the Web identifies many such products. The problem of slowness of standard
relational implementations for OO applications can be taken to be on the way to solution.
The latter problem, that the data definition for an OO application requires a large number
of tables with limited context, is a problem with the expressiveness of the standard
relational data model. In an OO application one frequently wants to navigate the complex
data structures specified. One might want the set of teams participating in a particular
race in a particular event, or the set of events in
which a particular competitor from a particular team is competing, or the association
between teams and events defined by the many-to-many relationship between Race and
Competitor. From the point of view of each of those queries, there is a nested-relational
packaging of the conceptual model which makes the query simple, simpler than the
standard relational representation. The unsuitablity of the NR model is that these NR
packagings are all different, and that a query not following the chosen packaging
structure is very complex.
However, we have already seen that the primary representation of the data can be in a
conceptual model. The relational representation can be, and generally is, constructed
algorithmically. If the DBMS creates the relational representation of the conceptual
model, then the conceptual model should be the basis for the query language. A query
expressed on the conceptual model can be translated into SQL DML in the same sort of
way that the model itself is translated into SQL DDL. In fact, there are a number of
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
conceptual query languages which permit the programmer to construct a query by
specifying a navigation through the conceptual model, for example ConQuer (Bloesch
and Halpin, 1996, 1997).
Using a language like ConQuer, the programmer can specify a navigation path through
the conceptual model, which when it traverses a one-to-many relationship opens the set
of instances on the target side. When it traverses a many-to-many relationship, the view
from the source of the path looks like a one-to-many. Such a traversal of the conceptual
model provides a sort of virtual nested-relational data packaging, which can be translated
into standard SQL without the programmer being aware of exactly how the data is
Page 41
Advanced RDBMS
packaged. This approach therefore is more true to the spirit of object-oriented software
development since the implementation of the specification is completely hidden.
1.2.14 Conclusion
The standard relational data model where the DDL and DML are both hidden beneath a
conceptual data modelling language and the DBMS is a main-memory implementation
optimised for OO-style applications, presents a much superior approach to the problem of
OO applications than does the nested relational data model.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
1.4. Intext Questions
1. Illustrate ODMG
2. What is C++ Language Binding ?
3. Explain what is the concept of Object Oriented Databases ?
4. Define Object Definition Language .
5. Write a note on Object Query Language.
6. The usage of CORBA in Database management – Discuss.
7. Explain Entity Relationship Diagram ?
Page 42
Advanced RDBMS
1.5. Summary
1. What is Database?
2. Define ODL, OQL?
3. What is Polymorphism?
4. What do you mean by OOAD?
5. ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
What is the main use of CORBA?
Page 43
Advanced RDBMS
1.8 Assignments
1. By using C++ write the ODL statements to fetch the data from the Inventory
database.
1.11 Keywords
1. Object-Oriented Database
2. ORDBMS - Object Relational Database Management System
3. ODMG – Object Database Management Group.
4. ODL – Object Definition Language
5. OQL – Object Query Language.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 44
Advanced RDBMS
UNIT- II
Topics:
Functional Dependencies & Normalization For Relational Database
Normal Forms Based on Primary Keys
General Definitions of Second and Third Normal Forms
Boyce-Codd Normal Form
Algorithms for Relational Database Schema Design
Multivalued Dependencies and Fourth Normal Form
Join Dependencies and Fifth Normal Form
The Database Design Process
2.0 Introduction
E. F. Codd, in the early 1970's using relational mathematics, devised a system where
tables can be designed in such a way that certain "anomalies" can be eliminated by the
selection of which columns (attributes) to be included in the table. Since relational
mathematics is based upon "relations", it is assumed that all tables in this discussion
satisfy the assumptions incorporated in a relation, mentioned earlier. The widespread use
of the relational database model is a fairly recent phenomenon because the operation of
joining tables requires considerable computer resources and it is only in recent years that
computer hardware is such that large relational databases can be satisfactorily
maintained.
Suppose we are now given the task of designing and creating a database. Good database
design needless to say, is important. Careless design can lead to uncontrolled data
redundancies that will lead to problems with data anomalies.
2.1 Objective
a. A Bad Design
E.Codd has identified certain structural features in a relation which create retrieval and
update problems. Suppose we start off with a relation with a structure and details like:
Page 45
Advanced RDBMS
Simple structure
This is a simple and straightforward design. It consists of one relation where we have a
single tuple for every customer and under that customer we keep all his transaction
records about parts, up to a possible maximum of 9 transactions. For every new
transaction, we need not repeat the customer details (of name, city and telephone), we
simply add on a transaction detail.
Let us try to construct a query to "Find which customer(s) bought P# 2" ? The query
would have to access every customer tuple and for each tuple, examine every of its
transaction looking for
Alternatively, why don't we re-structure our relation such that we do not restrict the
number of transactions per customer. We can do this with the following structure:
Page 46
Advanced RDBMS
This way, a customer can have just any number of Part transactions without worrying
about any upper limit or wasted space through null values (as it was with the previous
structure).
It seems a waste of storage to keep repeated values of Cname, Ccity and Cphone.
If C# 1 were to change his telephone number, we would have to ensure that we
update ALL occurrences of C# 1's Cphone values. This means updating tuple 1,
tuple 2 and all other tuples where there is an occurrence of C# 1. Otherwise, our
database would be left in an inconsistent state.
Suppose we now have a new customer with C# 4. However, there is no part
transaction yet with the customer as he has not ordered anything yet. We may find
that we cannot insert this new information because we do not have a P# which
serves as part of the 'primary key' of a tuple. Suppose the third transaction has
been canceled, i.e. we no longer need information about 25 of P# 1 being ordered
on 26 Jan. We thus delete the third tuple. We are then left with the following
relation:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
But then, suppose we need information about the customer "Martin", say the city he is
located in. Unfortunately as information about Martin was held in only that tuple and
having the entire tuple deleted because of its P# transaction, meant also that we have lost
all information about Martin from the relation.
Page 47
Advanced RDBMS
As illustrated in the above instances, we note that badly designed, unnormalised relations
waste storage space. Worse, they give rise to the following storage irregularities:
Update anomaly: Data inconsistency or loss of data integrity can arise from data
redundancy/repetition and partial update.
Insertion anomaly: Data cannot be added because some other data is absent.
Deletion anomaly: Data maybe unintentionally lost through the deletion of other
data.
2.2 Content
Intuitively, it would seem that these undesirable features can be removed by breaking a
relation into other relations with desirable structures. We shall attempt by splitting the
above Transaction relation into the following two relations, Customer and Transaction,
which can be viewed as entities with a one to many relationship.
Let us see if this new design will alleviate the above storage anomalies:
a. Update anomaly
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If C# 1 were to change his telephone number, as there is only one occurrence of the tuple
in the Customer relation, we need to update only that one tuple as there are no
redundant/duplicate tuples.
b. Addition anomaly
Adding a new customer with C# 4 can be easily done in the Customer relation of which
C# serves as the primary key. With no P# yet, a tuple in Transaction need not be created.
Page 48
Advanced RDBMS
c. Deletion anomaly
Canceling the third transaction about 25 of P# 1 being ordered on 26 Jan would now
mean deleting only the third tuple of the new Transaction relation above. This leaves
information about Martin still intact in the new Customer relation.
Data aggregates
Partial key dependency
Indirect key dependency
and the stages of normalisation that remove the associated problems are defined below.
We shall now show a more formal process on how we can decompose relations into
multiple relations by using the Normal Form rules for structuring.
According to (Elmasri & Navathe, 1994), the normalization process, as first proposed by
Codd (1972), takes a relation schema through a series of tests to "certify" whether or not
it belongs to a certain normal form. Initially, Codd proposed three normal forms, which
he called first, second, and third normal form. A stronger definition of 3NF was
proposed later by Boyce and Codd and is known as Boyce-Codd normal form (BCNF).
All these normal forms are based on the functional dependencies among the attributes of
a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were
proposed, based on the concepts or multivalued dependencies and join dependencies,
respectively.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Functional Dependencies
Page 49
Advanced RDBMS
A formal framework for analyzing relation schemas based on their keys and on
the functional dependencies among their attributes.
A series of tests that can be carried out on individual relation schema so that the
relational database can be normalized to any degree. When a test fails, the
relation violating that test must be decomposed into relations that individually
meet the normalization tests.
Normal forms, when considered in isolation from other factors, do not guarantee a good
database design. It is generally not sufficient to check separately that each relation
schema in the database is, say, in BCNF or 3NF. Rather, the process of normalization
through decomposition must also confirm the existence of additional properties that the
relational schemas, taken together, should possess. Two of these properties are:
The lossless join or nonadditive join property, which guarantees that the spurious
tuple problem does not occur.
The dependency preservation property, which ensures that all functional
dependencies are represented in some of the individual resulting relations.
Let's begin by creating a sample set of data. Imagine we are working on a system to keep
track of employees working on certain projects.
A problem with the above data should immediately be obvious. Tables in relational
databases, which would include most databases you'll work with, are in a simple grid, or
table format. Here, each project has a set of employees. So we couldn't even enter the
data into this kind of table. And if we tried to use null fields to cater for the fields that
have no value, then we cannot use the project number, or any other field, as a primary
key (a primary key is a field, or list of fields, that uniquely identify one record). There is
not much use in having a table if we can't uniquely identify each record in it.
Page 50
Advanced RDBMS
So, our solution is to make sure that each field has no sets, or repeating groups.
Now we can place the data in a table.
employee_project table
Notice that the project number cannot be a primary key on it's own. It does not uniquely
identify a row of data. So, our primary key must be a combination of project number and
employee number. Together these two fields uniquely identify one row of data. (Think
about it. You would never add the same employee more than once to a project. If for
some reason this could occur, you'd need to add something else to the key to make it
uniqueSo, now our data can go in table format, but there are still some problems with it.
We store the information that code 1023 refers to the Madagascar travel site 3 times!
Besides the waste of space, there is another serious problem. Look carefully at the data
below.
employee_project table
Madagascar travel
1023 12 Pauline James B $50
site
Madagascat travel Charles
1023 16 C $40
site Ramoraz
Online estate Vincent
1056 11 A $60
agency Radebe
Online estate Monique
1056 17 B $50
agency Williams
Page 51
Advanced RDBMS
Did you notice anything strange in the data above? Congratulations if you did!
Madagascar is misspelt in the 3rd record. Now imagine trying to spot this error in a table
with thousands of records! By using the structure above, the chances of the data being
corrupted increases drastically.
The solution is simply to take out the duplication. What we are doing formally is looking
for partial dependencies, ie fields that are dependent on a part of a key, and not the entire
key. Since both project number and employee number make up the key, we look for
fields that are dependent only on project number, or on employee number.
employee_project table
Clearly we can't simply take out the data and leave it out of our database. We take it out,
and put it into a new table, consisting of the field that has the partial dependency, and the
field it is dependent on. So, we identified employee name, hourly rate and rate category
as being dependent on employee number.
The new table will consist of employee number as a key, and employee name, rate
category and hourly rate, as follows:
Employee table
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Employee number Employee name Rate category Hourly rate
11 Vincent Radebe A $60
12 Pauline James B $50
16 Charles Ramoraz C $40
17 Monique Williams B $50
Page 52
Advanced RDBMS
Project table
Project number Project name
1023 Madagascar travel site
1056 Online estate agency
Note the reduction of duplication. The text "Madagascar travel site" is stored once only,
not for each occurrence of an employee working on that project. The link is made through
the key, the project number. Obviously there is no way to remove the duplication of this
number without losing the relation altogether, but it is far more efficient storing a short
number repeatedly, than a large piece of text
We're still not perfect. There is still room for anomalies in the data. Look carefully at the
data below.
Employee table
The problem above is that Monique Williams has been awarded an hourly rate of $40,
when she is actually category B, and should be earning $50 (In the case of this company,
the rate category - hourly rate relationship is fixed. This may not always be the case).
Once again we are storing data redundantly: the hourly rate - rate category relationship is
being stored in its entirety for each employee. The solution, as before, is to remove this
excess data into its own table. Formally, what we are doing is looking for transitive
relationships, or relationships where a non-key attribute is dependent on another non-key
relationship. Hourly rate, while being in one sense dependent on Employee number (we
probably identified this dependency earlier, when looking for partial dependencies) is
actually dependent on Rate category. So, we remove it, and place it in a new table, with
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
its actual key, as follows.
Employee table
Page 53
Advanced RDBMS
Rate table
We've cut down once again. It is now impossible to mistakenly assume rate category "B"
is associated with an hourly rate of anything but $50. These relationships are only stored
in once place - our new table, where it can be ensured they are accurate.
a. Modification Anomalies
Once our E-R model has been converted into relations, we may find that some
relations are not properly specified. There can be a number of problems:
o Deletion Anomaly: Deleting a relation results in some related information
(from another entity) being lost.
o Insertion Anomaly: Inserting a relation requires we have information
from two or more entities - this situation might not be feasible.
Here is a quick example: A company has a Purchase order form:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 54
Advanced RDBMS
Page 55
Advanced RDBMS
The normal forms defined in relational database theory represent guidelines for record
design. The guidelines corresponding to first through fifth normal forms are presented
here, in terms that do not require an understanding of relational theory. The design
guidelines are meaningful even if one is not using a relational database system. We
present the guidelines without referring to the concepts of the relational model in order to
emphasize their generality, and also to make them easier to understand. Our presentation
conveys an intuitive sense of the intended constraints on record design, although in its
informality it may be imprecise in some technical details. A comprehensive treatment of
the subject is provided by Date.
The normalization rules are designed to prevent update anomalies and data
inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward
the assumption that all non-key fields will be updated frequently. They tend to penalize
retrieval, since data which may have been retrievable from one record in an unnormalized
design may have to be retrieved from several records in the normalized form. There is no
obligation to fully normalize all records when actual performance requirements are taken
into account.
First normal form is now considered to be part of the formal definition of a relation;
historically, it was defined to disallow multivalued attributes, composite attributes, and
their combinations. It states that the domains of attributes must include only atomic
(simple, indivisible) values and that the value of any attribute in a tuple must be a single
value from the domain of that attribute.
Practical Rule: "Eliminate Repeating Groups," i.e., make a separate table for each set of
related attributes, and give each table a primary key.
Formal Definition: A relation is in first normal form (1NF) if and only if all underlying
simple domains contain atomic values only.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Under first normal form, all occurrences of a record type must contain the same number
of fields.
First normal form excludes variable repeating fields and groups. This is not so much a
design guideline as a matter of definition. Relational database theory doesn't deal with
records having a variable number of fields.
Page 56
Advanced RDBMS
Example 1:
Let's run again through the example we've just done, this time without the data tables to
guide us. After all, when you're designing a system, you usually won't have test data
available at this stage. The tables were there to show you the consequences of storing
data in unnormalized tables, but without them we can focus on dependency issues, which
is the key to database normalization.
Project number
Project name
1-n Employee numbers (1-n indicates that there are many occurrences of this field - it is a
repeating group)
1-n Employee names
1-n Rate categories
1-n Hourly rates
So, to begin the normalization process, we start by moving from zero normal form to 1st
normal form.
So far, we have no keys, and there are repeating groups. So we remove the repeating
groups, and define the primary key, and are left with the following:
Example 2:
Page 57
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To conclude a relation is in first normal form if it meets the definition of a relation:
Page 58
Advanced RDBMS
We deal now only with "single-valued" facts. The fact could be a one-to-many
relationship, such as the department of an employee, or a one-to-one relationship, such as
the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one
or one-to-many relationship between Y and X. In the general case, Y might consist of one
or more fields, and so might X. In the following example, QUANTITY is a fact about the
combination of PART and WAREHOUSE.
Practical Rule: "Eliminate Redundant Data," i.e., if an attribute depends on only part of a
multivalued key, remove it to a separate table.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Formal Definition: A relation is in second normal form (2NF) if and only if it obeys the
conditions of First Normal Form and every nonkey attribute is fully dependent on the
primary key.
Page 59
Advanced RDBMS
Example 1:
So, we go through all the fields. Considering our example, Project name is only
dependent on Project number. Employee name, Rate category and Hourly rate are
dependent only on Employee number. So we remove them, and place these fields in a
separate table, with the key being that part of the original key they are dependent on. So,
we are left with the following 3 tables:
Employee table
Project table
Example 2:
As we know that second normal form is violated when a non-key field is a fact about a
subset of a key. It is only relevant when the key is composite, i.e., consists of several
fields. Consider the following inventory record:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
The key here consists of the PART and WAREHOUSE fields together, but
WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems
with this design are:
The warehouse address is repeated in every record that refers to a part stored in
that warehouse.
Page 60
Advanced RDBMS
If the address of the warehouse changes, every record referring to a part stored in
that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with different
records showing different addresses for the same warehouse.
If at some point in time there are no parts stored in the warehouse, there may be
no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed into
(replaced by) the two records:
------------------------------- --------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-
ADDRESS |
====================----------- =============-------------------
When a data design is changed in this way, replacing unnormalized records with
normalized records, the process is referred to as normalization. The term "normalization"
is sometimes used relative to a particular normal form. Thus a set of records may be
normalized with respect to second normal form but not with respect to third.
The normalized design enhances the integrity of the data, by minimizing redundancy and
inconsistency, but at some possible performance cost for certain retrieval applications.
Consider an application that wants the addresses of all warehouses stocking a certain part.
In the unnormalized form, the application searches one record type. With the normalized
design, the application has to search two record types, and connect the appropriate pairs.
To summarize,
A relation is in second normal form (2NF) if all of its non-key attributes are
dependent on all of the key.
Relations that have a single attribute for a key are automatically in 2NF.
This is one reason why we often use artificial identifiers as keys.
In the example below, Close Price is dependent on Company, Date and Symbol,
Date
The following example relation is not in 2NF:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
STOCKS (Company, Symbol, Headquarters, Date, Close_Price)
Page 61
Advanced RDBMS
Practical Rule: "Eliminate Columns not Dependent on Key," i.e., if attributes do not
contribute to a description of a key, remove them to a separate table.
Page 62
Advanced RDBMS
Formal Definition: A relation is in third normal form (3NF) if and only if it is in 2NF
and every nonkey attribute is nontransitively dependent on the primary key.
Example 1:
We can narrow our search down to the Employee table, which is the only one with more
than one non-key attribute. Employee name is not dependent on either Rate category or
Hourly rate, the same applies to Rate category, but Hourly rate is dependent on Rate
category. So, as before, we remove it, placing it in it's own table, with the attribute it was
dependent on as key, as follows:
Employee table
Rate table
Project table
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Project number - primary key
Project name
These tables are all now in 3rd normal form, and ready to be implemented.
Example 2:
Third normal form is violated when a non-key field is a fact about another non-key field,
as in
Page 63
Advanced RDBMS
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the
LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about
the EMPLOYEE. The problems with this design are the same as those caused by
violations of second normal form:
To satisfy third normal form, the record shown above should be decomposed into the two
records:
------------------------- -------------------------
| Item No. | Part Number | | Part Number | Part Name |
------------------------- --------------------------
Example: At CUNY:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Course_Num, Section -> Classroom, Professor
Example: At Rutgers:
Page 64
Advanced RDBMS
Example:
Company County
SONY Putnam
AT&T Ritchie
Before you rush off and start normalizing everything, a word of warning. No process is
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
better than good old common sense. Take a look at this example.
Customer table
Page 65
Advanced RDBMS
What normal form is this table in? Giving it a quick glance, we see no repeating groups,
and a primary key defined, so it's at least in 1st normal form. There's only one key, so we
needn't even look for partial dependencies, so it's at least in 2nd normal form. How about
transitive dependencies? Well, it looks like Town might be determined by Zip Code. And
in most parts of the world that's usually the case. So we should remove Town, and place
it in a separate table, with Zip Code as the key? No! Although this table is not technically
in 3rd normal form, removing this information is not worth it.
Creating more tables increases the load slightly, slowing processing down. This is often
counteracted by the reduction in table sizes, and redundant data. But in this case, where
the town would almost always be referenced as part of the address, it isn't worth it.
Perhaps a company that uses the data to produce regular mailing lists of thousands of
customers should normalize fully. It always comes down to how the data is going to be
used. Normalization is just a helpful process that usually results in the most efficient table
structure, and not a rule for database design. But judging from some of the table
structures I've seen out there, it's better to err and normalize than err and not!
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SKU -> Compact_Disk_Title, Artist
Model, Options, Tax -> Car_Price
Course_Number, Section -> Professor, Classroom, Number of
Students
The attributes listed on the left hand side of the -> are called determinants.
One can read A -> B as, "A determines B".
Key: One or more attributes that uniquely identify a tuple (row) in a relation.
Page 66
Advanced RDBMS
The selection of keys will depend on the particular application being considered.
Users can offer some guidance as to what would make an appropriate key. Also
this is pretty much an art as opposed to an exact science.
Recall that no two relations should have exactly the same values, thus a candidate
key would consist of all of the attributes in a relation.
A key functionally determines a tuple (row).
In relational database theory, second and third normal forms are defined in terms of
functional dependencies, which correspond approximately to our single-valued facts. A
field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two
records with the same X-value but different Y-values. That is, a given X-value must
always occur with the same Y-value. When X is a key, then all fields are by definition
functionally dependent on X in a trivial way, since there can't be two records having the
same X value.
PERSON ADDRESS
John Smith 123 Main St.,
New York
John Smith 321 Center St., San Francisco
Although each person has a unique address, a given name can appear with several
different addresses. Hence we do not have a functional dependency corresponding to our
single-valued fact.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Similarly, the address has to be spelled identically in each occurrence in order to have a
functional dependency. In the following case the same person appears to be living at two
different addresses, again precluding a functional dependency.
---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------
Page 67
Advanced RDBMS
For instance, we as designers know that in the following example there is a single-valued
fact about a non-key field, and hence the design is susceptible to all the update anomalies
mentioned earlier.
----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------
Boyce-Codd normal form is stricter than 3NF, meaning that every relation in BCNF is
also in 3NF; however, a relation in 3NF is not necessarily in BCNF. A relation schema is
an BCNF if whenever a functional dependency X->A holds in the relation, then X is a
superkey of the relation. The only difference between BCNF and 3NF is that condition
(b) of 3NF, which allows A to be prime if X is not a superkey, is absent from BCNF.
Formal Definition: A relation is in Boyce/Codd normal form (BCNF) if and only if every
determinant is a candidate key. [A determinant is any attribute on which some other
attribute is (fully) functionally dependent.]
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Steps in analyzing for BCNF:
(1) Find and list all the candidate keys. (Usually the primary key is known.)
(2) Determine and list all functional dependencies, noting those which are
dependent on attributes which are not the entire primary key.
(3) Determine if any dependencies exist which are based on part but not all of a
candidate key.
(4) Project into relations which remove the problems found in (3).
To summarize,
Page 68
Advanced RDBMS
In this case, the combination FundID and InvestmentType form a candidate key
because we can use FundID,InvestmentType to uniquely identify a tuple in the
relation.
Similarly, the combination FundID and Manager also form a candidate key because
we can use FundID, Manager to uniquely identify a tuple.
Manager by itself is not a candidate key because we cannot use Manager alone to
uniquely identify a tuple in the relation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Is this relation R(FundID, InvestmentType, Manager) in 1NF, 2NF or 3NF ?
Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure.
2NF because all of the non-key attributes (Manager) is dependant on all of the key.
3NF because there are no transitive dependencies.
Consider what happens if we delete the tuple with FundID 22. We loose the fact that
Brown manages the InvestmentType "Common Stocks."
Page 69
Advanced RDBMS
Rnew(Manager, InvestmentType)
Rorig(FundID, Manager)
In this last step, we have retained the determinant "Manager" in the original relation
Rorig.
Fourth and fifth normal forms deal with multi-valued facts. The multi-valued fact may
correspond to a many-to-many relationship, as with employees and skills, or to a many-
to-one relationship, as with the children of an employee (assuming only one parent is an
employee). By "many-to-many" we mean that an employee may have several skills, and a
skill may belong to several employees.
In a sense, fourth and fifth normal forms are also about composite keys. These normal
forms attempt to minimize the number of fields involved in a composite key, as
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
suggested by the examples to follow.
Page 70
Advanced RDBMS
Practical Rule: "Isolate Independent Multiple Relationships," i.e., no table may contain
two or more 1:n or n:m relationships that are not directly related.
Formal Definition: A relation R is in fourth normal form (4NF) if and only if, whenever
there exists a multivalued dependency in the R, say A->>B, then all attributes of R are
also functionally dependent on A.
Under fourth normal form, a record type should not contain two or more independent
multi-valued facts about an entity. In addition, the record must satisfy third normal form.
Consider employees, skills, and languages, where an employee may have several skills
and several languages. We have here two many-to-many relationships, one between
employees and skills, and one between employees and languages. Under fourth normal
form, these two relationships should not be represented in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
Note that other fields, not involving multi-valued facts, are permitted to occur in the
record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE
ANNAMALAI
ANNAMALAI UNIVERSITY
example.
UNIVERSITY
The main problem with violating fourth normal form is that it leads to uncertainties in the
maintenance policies. Several policies are possible for maintaining two independent
multi-valued facts in one record:
(1) A disjoint format, in which a record contains either a skill or a language, but not both:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
Page 71
Advanced RDBMS
| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
This is not much different from maintaining two separate record types. (We note in
passing that such a format also leads to ambiguities regarding the meanings of blank
fields. A blank SKILL could mean the person has no skill, or the field is not applicable to
this employee, or the data is unknown, or, as in this case, the data may be found in
another record.)
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(c) Unrestricted:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
Page 72
Advanced RDBMS
(3) A "cross-product" form, where for each employee, there must be a record for every
possible pairing of one of his skills with one of his languages:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Other problems caused by violating fourth normal form are similar in spirit to those
mentioned earlier for violations of second or third normal form. They take different
variations depending on the chosen maintenance policy:
If there are repetitions, then updates have to be done in multiple records, and they
could become inconsistent.
Insertion of a new skill may involve looking for a record with a blank skill, or
inserting a new record with a possibly blank language, or inserting multiple
records pairing the new skill with some or all of the languages.
Deletion of a skill may involve blanking out the skill field in one or more records
(perhaps with a check that this doesn't leave two records with the same language
and a blank skill), or deleting one or more records, coupled with a check that the
last mention of some language hasn't also been deleted.
a. Independence
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
and languages. There is only an indirect connection because they belong to some
common employee. That is, it does not matter which skill is paired with which language
in a record; the pairing does not convey any information. That's precisely why all the
maintenance policies mentioned earlier can be allowed.
In contrast, suppose that an employee could only exercise certain skills in certain
languages. Perhaps Smith can cook French cuisine only, but can type in French, German,
and Greek. Then the pairings of skills and languages becomes meaningful, and there is no
longer an ambiguity of maintenance policies. In the present case, only the following form
is correct:
Page 73
Advanced RDBMS
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
b. Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit
further, we mention that fourth normal form is defined in terms of multivalued
dependencies, which correspond to our independent multi-valued facts. Multivalued
dependencies, in turn, are defined essentially as relationships which accept the "cross-
product" maintenance policy mentioned above. That is, for our example, every one of an
employee's skills must appear paired with every one of his languages. It may or may not
be obvious to the reader that this is equivalent to our notion of independence: since every
possible pairing must be present, there is no "information" in the pairings. Such pairings
convey information only if some of them can be absent, that is, only if it is possible that
some employee cannot perform some skill in some language. If all pairings are always
present, then the relationships are really independent.
We should also point out that multivalued dependencies and fourth normal form apply as
well to relationships involving more than two fields. For example, suppose we extend the
earlier example to include projects, in the following sense:
If there is no direct connection between the skills and languages that an employee uses on
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
a project, then we could treat this as two independent many-to-many relationships of the
form EP:S and EP:L, where "EP" represents a combination of an employee with a
project. A record including employee, project, skill, and language would violate fourth
normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy
fourth normal form.
To summarize,
Page 74
Advanced RDBMS
Book example:
Student has one or more majors.
Student participates in one or more activities.
Portfolio
Stock Fund Bond Fund
ID
999 Janus Fund Municipal Bonds
Dreyfus Short-Intermediate Municipal Bond
ANNAMALAI
ANNAMALAI UNIVERSITY
999
UNIVERSITY
Janus Fund
Fund
Scudder Global
999 Municipal Bonds
Fund
Scudder Global Dreyfus Short-Intermediate Municipal Bond
999
Fund Fund
888 Kaufmann Fund T. Rowe Price Emerging Markets Bond Fund
A few characteristics:
Page 75
Advanced RDBMS
Stock Fund and Bond Fund form a multivalued dependency on Portfolio ID.
In some cases there may be no losses join decomposition into two relation schemas but
there may be a losses join decomposition into more than two relation schemas. These
cases are handled by the join dependency and fifth normal form, and it’s important to
note that these cases occur very rarely and are difficult to detect in practice.
Practical Rule: "Isolate Semantically Related Multiple Relationships," i.e., there may be
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
practical constraints on information that justify separating logically related many-to-
many relationships.
Page 76
Advanced RDBMS
with respect to a set F of functional, multivalued, and join dependencies if, for every
nontrivial join dependency JD(R1, R2, …, Rn) in F (implied by F), every Ri is a superkey
of R.
Fifth normal form deals with cases where information can be reconstructed from smaller
pieces of information that can be maintained with less redundancy. Second, third, and
fourth normal forms also serve this purpose, but fifth normal form generalizes to cases
not covered by the others.
We will not attempt a comprehensive exposition of fifth normal form, but illustrate the
central concept with a commonly used example, namely one involving agents,
companies, and products. If agents represent companies, companies make products, and
agents sell products, then we might want to keep a record of which agent sells which
product for which company. This information could be kept in one record type with three
fields:
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
This form is necessary in the general case. For example, although agent Smith sells cars
made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we
need the combination of three fields to know which combinations are valid and which are
not. But suppose that a certain rule was in effect: if an agent sells a certain product, and
he represents a company making that product, then he sells that product for that company.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------
In this case, it turns out that we can reconstruct all the true facts from a normalized form
consisting of three separate record types, each containing two fields:
Page 77
Advanced RDBMS
These three record types are in fifth normal form, whereas the corresponding three-field
record shown previously is not.
Roughly speaking, we may say that a record type is in fifth normal form when its
information content cannot be reconstructed from several smaller record types, i.e., from
record types each having fewer fields than the original record. The case where all the
smaller records have the same key is excluded. If a record type can only be decomposed
into smaller records which all have the same key, then the record type is considered to be
in fifth normal form without decomposition. A record type in fifth normal form is also in
fourth, third, second, and first normal forms.
Fifth normal form does not differ from fourth normal form unless there exists a
symmetric constraint such as the rule about agents, companies, and products. In the
absence of such a constraint, a record type in fourth normal form is always in fifth normal
form.
One advantage of fifth normal form is that certain redundancies can be eliminated. In the
normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized
form it may be repeated many times.
It should be observed that although the normalized form involves more record types,
there may be fewer total record occurrences. This is not apparent when there are only a
few facts to record, as in the example shown above. The advantage is realized as more
facts are recorded, since the size of the normalized files increases in an additive fashion,
while the size of the unnormalized file increases in a multiplicative fashion. For example,
if we add a new agent who sells x products for y companies, where each of these
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
companies makes each of these products, we have to add x+y new records to the
normalized form, but xy new records to the unnormalized form.
It should be noted that all three record types are required in the normalized form in order
to reconstruct the same information. From the first two record types shown above we
learn that Jones represents Ford and that Ford makes trucks. But we can't determine
whether Jones sells Ford trucks until we look at the third record type to determine
whether Jones sells trucks at all.
Page 78
Advanced RDBMS
The following example illustrates a case in which the rule about agents, companies, and
products is satisfied, and which clearly requires all three record types in the normalized
form. Any two of the record types taken alone will imply something untrue.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------
Observe that:
Jones sells cars and GM makes cars, but Jones does not represent GM.
Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
Brown represents Ford and Brown sells buses, but Ford does not make buses.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Fourth and fifth normal forms both deal with combinations of multivalued facts. One
difference is that the facts dealt with under fifth normal form are not independent, in the
sense discussed earlier. Another difference is that, although fourth normal form can deal
with more than two multivalued facts, it only recognizes them in pairwise groups. We can
best explain this in terms of the normalization process implied by fourth normal form.
If a record violates fourth normal form, the associated normalization process decomposes
it into two records, each containing fewer fields than the original record. Any of these
violating fourth normal form is again decomposed into two records, and so on until the
Page 79
Advanced RDBMS
resulting records are all in fourth normal form. At each stage, the set of records after
decomposition contains exactly the same information as the set of records before
decomposition.
To summarize,
There are certain conditions under which after decomposing a relation, it cannot
be reassembled back into its original form.
We don't consider these issues here.
We can also always define stricter forms that take into account additional types of
dependencies and constraints. The idea behind domain-key normal form is to specify,
(theoretically, at least) the "ultimate normal form" that takes into account all possible
dependencies and constraints. A relation is said to be in DKNF if all constraints and
dependencies that should hold on the relation can be enforced simply by enforcing the
domain constraints and the key constraints specified on the relation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Constraint: An rule governing static values of an attribute such that we can determine if
this constraint is True or False. Examples:
1. Functional Dependencies
2. Multivalued Dependencies
3. Inter-relation rules
4. Intra-relation rules
Page 80
Advanced RDBMS
Domain: The physical (data type, size, NULL values) and semantic (logical)
description of what values an attribute can hold.
There is no known algorithm for converting a relation directly into DK/NF.
Unavoidable Redundancies
a. Inter-Record Redundancy
The normal forms discussed here deal only with redundancies occurring within a single
record type. Fifth normal form is considered to be the "ultimate" normal form with
respect to such redundancies..
Other redundancies can occur across multiple record types. For the example concerning
employees, departments, and locations, the following records are in third normal form in
spite of the obvious redundancy:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------
In fact, two copies of the same record type would constitute the ultimate in this kind of
undetected redundancy.
Inter-record redundancy has been recognized for some time, and has recently been
addressed in terms of normal forms and normalization.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
While we have tried to present the normal forms in a simple and understandable way, we
are by no means suggesting that the data design process is correspondingly simple. The
design process involves many complexities which are quite beyond the scope of this
paper. In the first place, an initial set of data elements and records has to be developed, as
candidates for normalization.
Page 81
Advanced RDBMS
The information Systems help the management people to organize their schedules
and plan for the development and growth of the organization themselves. Modern
technologies goes a step even deeper that it gives the information to the mobile
phones of the top authorities. They are provided with the needed information on
their palm tops
Many of you asked for a "complete" example that would run through all of the normal
forms from beginning to end using the same tables. This is tough to do, but here is an
attempt:
Example relation:
EMPLOYEE ( Name, Project, Task, Office, Phone )
Example Data:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Name Project Task Office Floor Phone
Bill 100X T1 400 4 1400
Bill 100X T2 400 4 1400
Bill 200Y T1 400 4 1400
Bill 200Y T2 400 4 1400
Sue 100X T33 442 4 1442
Sue 200Y T33 442 4 1442
Page 82
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill
Bill
200Y
200Y
T1
T2
Sue 100X T33
Sue 200Y T33
Sue 300Z T33
Ed 100X T2
Page 83
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill
Sue
400
442
4
4
Ed 588 5
Office Phone
400 1400
Page 84
Advanced RDBMS
442 1442
588 1588
Name Project
Bill 100X
Bill 200Y
Sue 100X
Sue 200Y
Sue 300Z
Ed 100X
Name Task
Bill T1
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY Bill T2
Sue T33
Ed T2
Page 85
Advanced RDBMS
Sue 442 4
Ed 588 5
R4 (Office, Phone)
Office Phone
400 1400
442 1442
588 1588
Relation Name
CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone)
Example Data
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
f. Functional Dependencies
g. Normalization
Page 86
Advanced RDBMS
Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to
BCNF.
The databases needed to be tuned and pruned in order to provide with a updated and
upgraded information. So it is vital to manage the database .
If you want to get the maximum performance from your applications you need to tune
your SQL statements. Tuning of SQL statements means discovering the execution plan
that Oracle is using. Once the execution plan is known one can attempt to improve it.
The performance of Query can be improved in many ways. By creating indexes one can
increase the size of the buffer cache, and use optimizer hints. Hints are instructions to the
Oracle optimizer that are buried within your statement. These Hints are used to control
virtually any aspect of statement execution.
Two important points when tuning rollback segments are Detecting contention and
reducing shrinkage.
Contention occurs when there are too few rollback segments in your database for the
amount of updates that are occurring. Shrinkage occurs when one defined an optimal
size for a rollback segment and then the rollback segment grows beyond that size and is
forced to shrink back again.
Normalization is carried out in practice so that the resulting designs are of high quality
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
and meet the desirable properties
The practical utility of these normal forms becomes questionable when the constraints on
which they are based are hard to understand or to detect
The database designers need not normalize to the highest possible normal form. ( usually
up to 3NF, BCNF or 4NF)
Denormalization: the process of storing the join of higher normal forms relations as a
base relation – which is in a lower normal form
Page 87
Advanced RDBMS
2.5. Summary
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
domains contain atomic values only.
A relation is in second normal form (2NF) if and only if it is in 1NF and every
non key attribute is fully dependent on the primary key.
A relation is in third normal form (3NF) if and only if it is in 2NF and every non
key attribute is non transitively dependent on the primary key.
A relation is in Boyce/Codd normal form (BCNF) if and only if every determinant
is a candidate key. [A determinant is any attribute on which some other attribute
is (fully) functionally dependent.]
A relation R is in fourth normal form (4NF) if and only if, whenever there exists a
multivalued dependency in the R, say A->>B, then all attributes of R are also
functionally dependent on A.
Page 88
Advanced RDBMS
2.8 Assignment
Prepare assignment about oracle 8i.
ANNAMALAI
ANNAMALAI UNIVERSITY
2.11 Keywords UNIVERSITY
1. Normalization
2. Boyce-Codd Normal form
3. 1NF – First Normal Form
4. 2NF – Second Normal Form
5. Multi-Valued Dependency
6. Functional Dependency
Page 89
Advanced RDBMS
UNIT - III
Topics:
Database System Architecture and The System Catalog
System Catalog Information
Data Dictionary and Data Repository Systems
Query Processing and Optimization: Translating SQL Queries
into Relational Algebra
Basic Algorithms for Executing Query Operations
Using Heuristics In Query Optimization
Query Optimization in Oracle
Transaction Processing Concepts
3.0. Introduction
The data base system architecture and the system catalog system architecture forms a
base to understand the basic structure and functions of database management system. So
many varieties of database management softwares and several Object linking and
Embedding objects and Broker architectures are necessary for database enthusiasts to
understand in order to effectively implement
3.1. Objective
The objective of this unit is to understand the Database System Architecture, Information
Accesses By DBMS Software Modules such as Data Dictionary and Data Repository
Systems. Query Processing and Optimization is an area of interest to understand the
Basic Algorithms for Executing Query Operations – Using Heuristics In Query
Optimization. Transaction Processing Concepts are explained in terms of Transaction
and System Concepts which includes Schedules and Recoverability.
3.2 Content
3.2.1 Database System Architectures and the System catalog: System Architectures
for DBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.
Data Model Operations: Operations for specifying database retrievals and updates by
referring to the concepts of the data model. Operations on the data model may include
basic operations and user-defined operations.
Page 90
Advanced RDBMS
a. Integrated data.
Integrated data means that the database may be thought of as a unification of several
otherwise distinct data files, with any redundancy among those files either wholly or
partly eliminated.
Consequences of integration are sharing and the idea that any given user will normally be
concerned with only a subset of the total database; moreover, different user's subsets will
overlap in many different ways i.e. a given database will be perceived by different users
in different ways. Also, users may be restricted to certain subsets of data.
b. Definition of Entity.
An entity is any distinguishable real world object that is to be represented in the database;
each entity will have attributes or properties e.g. the entity lecture has the properties
place and time . A set of similar entities is known as an entity type.
A network database consists of two data sets, a set of records and a set of links, where the
record types are made up of fields in the usual way.
Networks are complicated data structures. Operators on network databases are complex,
functioning on individual records, and not sets of records. Increased complexity does not
mean increased functionality and the network model is no more powerful than the
relational model. However, a network-based DBMS can provide good performance
because its lack of abstraction means it is closer the the storage structured used though
this is at the expense of good user programming. The network model also incorporates
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
certain integrity rules.
d. System Tables
Information about the database is maintained in the system catalogs. These vary from
system to system because the contents of the system catalog is specific to a particular
system. The INFORMIX system contains the following tables in it's system catalog.
Page 91
Advanced RDBMS
System Catalogs
Every DBMS requires information by which it can estimate the cost of various possible
plans that may be use to execute a query, so as to choose the best plan. For this it
maintains histograms also known as catalogs. The catalogs used by Postgres is a
combination of equidepth and end-biased histograms, which leads to accurate prediction
of both, frequently occurring as well as range distribution of data values.
The system catalogs are the place where a relational database management system stores
schema metadata, such as information about tables and columns, and internal
bookkeeping information. PostgreSQL's system catalogs are regular tables.
You can drop and recreate the tables, add columns, insert and update values, and severely
mess up your system that way
Access to the data dictionary is allowed through numerous views, which are
divided into three categories: USER, ALL, and DBA.
The system catalog contains information about all three levels of database
schemas: external (view definitions), conceptual (base tables), and internal
(storage and index descriptions).
Page 92
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 93
Advanced RDBMS
a. Predefined types
Security and authorization information is also kept in the catalog; this describes
each user’s privileges to access specific database relations and views, and the
creator or owner of each relation.
Most relational systems store their catalog files as DBMS relations. However,
because the catalog is accessed very frequently by the DBMS modules, it is
important to implement catalog access as efficiently as possible.
It may be more efficient to use a specialized set of data structures and access
routines to implement the catalog, thus trading generality for efficiency.
System initialization problem: The catalog tables must be created before the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
system can function!
The data dictionary is the repository for database metadata, which is a fancy term for data
describing the database. When you create a table, your description of that table is
considered metadata, and Oracle stores that metadata in its data dictionary. Similarly,
Oracle stores the definitions for other objects you create, such as views, PL/SQL
packages, triggers, synonyms, indexes, and so forth. The database software uses this
metadata to interpret and execute SQL statements, and to properly manage stored data.
Page 94
Advanced RDBMS
You can use the metadata as your window into the database. Whether you're a DBA or a
developer, you need a way to learn about the objects and data within your database.
Codd's fourth rule for relational database systems states that database metadata must be
stored in relational tables just like any other type of data. Oracle exposes database
metadata through a large collection of data dictionary views. Does this violate Codd's
rule? By no means! Oracle's data dictionary views are all based on tables, but the views
provide a much more user-friendly presentation of the metadata.
For example, to find out the names of all of the relational tables that you own, you can
issue the following query:
SELECT table_name
FROM user_tables;
Oracle divides data dictionary views into the three families, as indicated by the following
prefixes:
USER_
ALL_
ALL views return information about all objects to which you have access,
regardless of who owns them. For example, a query to ALL_TABLES returns a
list not only of all of the relational tables that you own, but also of all relational
tables to which their owners have specifically granted you access (using the
GRANT command).
DBA_
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
DBA views are generally accessible only to database administrators, and return
information about all objects in the database, regardless of ownership or access
privileges. For example, a query to DBA_TABLES will return a list of all
relational tables in the database, whether or not you own them or have been
granted access to them. Occasionally, database administrators will grant
developers access to DBA views. Usually, unless you yourself are a DBA, you
won't have access to the DBA views.
Many views have analogs in all three groups. For example, you have USER_TABLES,
ALL_TABLES, and DBA_TABLES. A table is a schema object, and thus owned by a
Page 95
Advanced RDBMS
user, hence the need for USER_TABLES. Table owners can grant specific users access to
their tables, hence the need for ALL_TABLES. Database administrators need to be aware
of all tables in the database, hence the need for DBA_TABLES. In some cases, it doesn't
make sense for a view to have an analog in all groups.
Oracle's data dictionary views are mapped onto underlying base tables, but the views
form the primary interface to Oracle's metadata. Unless you have specific reasons to go
around the views directly to the underlying base tables, you should use the views. The
views return data in a much more understandable format than you'll get from querying
the underlying tables. In addition, the views make up the interface that Oracle documents
and supports. Using an undocumented interface, i.e. the base tables, is a risky practice.
The primary source of information on Oracle's many data dictionary views is the Oracle9i
Database Reference manual.
You can access that manual, and many others, from the Oracle Technology Network
(OTN). You have to register with OTN in order to view Oracle's documentation online,
but registration is free. If you prefer a hardcopy reference, Oracle In A Nutshell,
published by O'Reilly & Associates, is another source of Oracle data dictionary
information.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 96
Advanced RDBMS
a. Relational Algebra
The basic set of operations for the relational model is known as the relational algebra.
These operations enable a user to specify basic retrieval requests.
The result of a retrieval is a new relation, which may have been formed from one or more
relations. The algebra operations thus produce new relations, which can be further
manipulated using operations of the same algebra.
i. SELECT Operation
SELECT operation is used to select a subset of the tuples from a relation that satisfy a
selection condition. It is a filter that keeps only those tuples that satisfy a qualifying
condition – those satisfying the condition are selected while others are discarded.
Example: To select the EMPLOYEE tuples whose department number is four or those
whose salary is greater than $ 30,000 the following notation is used:
DNO = 4 ( EMPLOYEE) SALARY > 30,000 ( EMPLOYEE)
In general, the select operation is denoted by |< selection condition>(R) where the
symbol ∑ (sigma) is used to denote the select operator, and the selection condition is a
Boolean expression specified on the attributes of relation R
Page 97
Advanced RDBMS
This operation selects certain columns from the table and discards the other columns. The
PROJECT creates a vertical partitioning – one with the needed columns (attributes)
containing results of the operation and other containing the discarded Columns.
Example: To list each employee’s first and last name and salary, the following is used:
The general form of the project operation is <attribute list>(R) where (pi) is the
symbol used to represent the project operation and <attribute list> is the desired list of
attributes from the attributes of relation R.
The project operation removes any duplicate tuples, so the result of the project operation
is a set of tuples and hence a valid relation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
<list> (R)is always less or equal to the number of tuples in R.
If the list of attributes includes a key of R, then the number of tuples is equal to the
number of tuples in R.
<list1> (<list2> (R) )=list1> (R) as long as <list2> contains the attributes in <list2>
We may want to apply several relational algebra operations one after the other.
Page 98
Advanced RDBMS
Either we can write the operations as a single relational algebra expression by nesting
the operations, or we can apply one operation at a time and create
intermediate result relations. In the latter case, we must give names to the relations that
hold the intermediate results.
Example: To retrieve the first name, last name, and salary of all employees
who work in department number 5, we must apply a select and a project operation.
We can write a single relational algebra expression as follows:
DEP5_EMPS
DNO=5(EMPLOYEE)
RESULT
FNAME, LNAME, SALARY (DEP5_EMPS)
The general Rename operation can be expressed by any of the following forms:
Ñ
S (B1, B2, …, Bn )(R) is a renamed relation S based on R with column names B1, B1,
…..Bn.
-
ñ
S ( R) is a renamed relation S based on R (which does not specify column names).
-
ñ
(B1, B2, …, Bn )(R) is a renamed relation with column names B1, B1, …..Bn which does
not specify a new relation name.
a. UNION Operation
The result of this operation, denoted by R
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
S, is a relation that includes all tuples that are either in R or in S or in both R and S.
Duplicate tuples are eliminated.
The union operation produces the tuples that are in either RESULT1 or RESULT2 or
both. The two operands must be “type compatible”.
Page 99
Advanced RDBMS
b. Type Compatibility
The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) must have the same
number of attributes, and the domains of corresponding attributes must be compatible;
that is, dom(Ai)=dom(Bi) for i=1, 2, ..., n.
The resulting relation for R1R2,R1 R2, or R1-R2 has the same attribute names as the
first operand relation R1
UNION Example
STUDENT U INSTRUCTOR
c. Intersection Operation
S, is a relation that includes all tuples that are in both R and S. The two operands must be
"type compatible" Example: The result of the intersection operation
Instructor Students
STUDENT
INSTRUCTOR
The result of this operation, denoted by R. -S, is a relation that includes all tuples that are
in R but not in S. The two operands must be " type compatible”.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example: The figure shows the names of students who are not instructors, and
the names of instructors who are not students.
STUDENT-INSTRUCTOR
INSTRUCTOR-STUDENT
Both union and intersection can be treated as n-ary operations applicable to any number
of relations as both are associative operations;
Page 100
Advanced RDBMS
This operation is used to combine tuples from two relations in a combinatorial fashion. In
general, the result of R(A1, A2, . ., An) x S(B1,B2, ..., Bm) is a relation Qwith degree
n+m attributes Q(A1, A2, .., An, B1, B2, ..., Bm), in that order. The resulting relation Q
has one tuple for each combination of tuples—one from Rand one from S.
Hence, if Rhas nR tuples (denoted as |R| =nR ), and Shas nS tuples, then|RxS |will have
nR * nS tuples.
Example:
FEMALE_EMPS
SEX=’F’(EMPLOYEE)
EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS •© EMPNAMES xDEPENDENT
a. JOIN Operation
The sequence of cartesian product followed by select is used quite commonly to identify
and select related tuples from two relations, a special operation, called JOIN.
This operation is very important for any relational database with more than a single
relation, because it allows us to process relationships among relations.
The general form of a join operation on two relations R(A1, A2,., An) and S(B1, B2, ..,
Bm) is:
R < join condition>S where R and S can be any relations that result from general
relational algebra expressions.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example: Suppose that we want to retrieve the name of the manager of each department.
To get the manager’s name, we need to combine each DEPARTMENT tuple with the
EMPLOYEE tuple whose SSN value matches the MGRSSN value in the department
tuple. We do this by using the join operation.
DEPT_MGR
DEPARTMENT MGRSSN=SSNEMPLOYEE
b. EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons only.
Page 101
Advanced RDBMS
Such a join, where the only comparison operator used is =, is called an EQUIJOIN. In the
result of an EQUIJOIN we always have one or more pairs of attributes that have identical
values in every tuple.
The standard definition of natural join requires that the two join attributes, or each pair
of corresponding join attributes, have the same name in both relations. If this is not the
case, a renaming operation is applied first.
The set of operations including select, project union, set difference - , and cartesian
product X
is called a complete set because any other relational algebra expression can be expressed
by a combination of these five operations.
For example:
R <S=(R+ S)–((R-S) •¾(S-R))R<join condition>S=ó<join condition> (RXS)
d. DIVISION Operation
The result of DIVISION is a relation T(Y) that includes a tuple t if tuples tR appear in R
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
with tR [Y] =t, and with tR [X] =ts for every tuple ts in S.
For a tuple t to appear in the result T of the DIVISION, the values in t must appear in R
in combination with every tuple in S.
A type of request that cannot be expressed in the basic relational algebra is to specify
mathematical aggregate functions on collections of values from the database.
Page 102
Advanced RDBMS
FMIN Salary (Employee) retrieves the minimum Salary value from the Employee
relation
FSUM Salary ( Employee) retrieves the sum of the Salary from the Employee relation
DNO FCOUNT SSN, AVERAGE Salary ( Employee) groups employees by DNO
(department number) and computes the count of employees and average salary per
department
c. Recursive Closure Operations
Another type of operation that, in general, cannot be specified in the basic original
relational algebra is recursive closure. This operation is applied to a recursive
relationship.
In NATURAL JOIN tuples without a matching (or related) tuple are eliminated from the
join result. Tuples with null in the join attributes are also eliminated.
A set of operations, called outer joins, can be used when we want to keep all the tuples in
R, or all those in S, or all those in both relations in the result of the join, regardless of
whether or not they have matching tuples in the other relation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The left outer join operation keeps every tuple in the first or left relation R In R S; if no
matching tuple is found in S, then the attributes of S in the join result are filled or “
padded” with null values.
A similar operation, right outer join, keeps every tuple in the second or right relation S in
the result of RS.
A third operation, full outer join, denoted by keeps all tuples in both the left and the right
relations when no matching tuples are found, padding them with null values as needed.
Page 103
Advanced RDBMS
The outer union operation was developed to take the union of tuples from two
relations if the relations are not union compatible.
This operation will take the union of tuples in two relations R(X, Y) and S(X, Z)
that are partially compatible, meaning that only some of their attributes, say X,
are union compatible.
The attributes that are union compatible are represented only once in the result,
and those attributes that are not union compatible from either relation are also
kept in the result relation T(X, Y, Z).
The tuple relational calculus is based on specifying a number of tuple variables. Each
tuple variable usually ranges over a particular database relation, meaning that the
variable may take as its value any individual tuple from that relation.
Example: To find the first and last names of all employees whose salary is above
$ 50,000, we can write the following tuple calculus expression:
{ t.FNAME, t.LNAME | EMPLOYEE(t) AND t.SALARY>50000}
The condition EMPLOYEE(t) specifies that the range relation of tuple variable t
Is EMPLOYEE. The first and last name (PROJECTION FNAME, LNAME) of each
EMPLOYEE tuple t that satisfies the condition t.SALARY>50000
Page 104
Advanced RDBMS
Basic Algorithms
a. External Sorting
Sorting is one of the primary algorithms used in Query processing (eg., ORDER
BY-clause requires a sorting).
External Sorting is used for large files of records stored on disk that do not fit
entirely in main memory.
The typical external sorting algorithm uses a sort-merge strategy. The algorithm
consists of two phases:
1. Sorting Phase
2. Merging Phase
A number of search algorithms are possible for selecting records from a file.
The JOIN operation is one of the most time consuming operations in query
processing.
Four of the most common techniques for performing a join are as following:
1. Nested-loop join (brute force)
2. Single loop join (Using an access structure to retrieve the matching
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
records).
3. Sort-merge join.
4. Hash join.
Page 105
Advanced RDBMS
Two special symbols called quantifiers can appear in formulas; these are the
universal quantifier () and the existential quantifier ().
If F is a formula, then so is (t)(F), where t is a tuple variable. The formula (t)(F) is true if
the formula F evaluates to true for some tuple assigned to free occurrences of t in F;
otherwise (t)(F) is false.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If F is a formula, then so is (t)(F), where t is a tuple variable. The formula(t)(F) is true if
the formula F evaluates to true for every tuple) assigned to free occurrences of t in F;
otherwise (t)(F) is false.
It is called the universal or “for all” quantifier because every tuple in “ the universe of”
tuples must make F true to make the quantified formula true.
Retrieve the name and address of all employees who work for the ‘Research’department.
Page 106
Advanced RDBMS
Query :
{t.FNAME, t.LNAME, t.ADDRESS |EMPLOYEE(t) and (d)(DEPARTMENT(d)
and d.DNAME=‘Research’ and d.DNUMBER=t.DNO) }
The only free tuple variables in a relational calculus expression should be those that
appear to the left of the bar (| ). In above query, t is the only free variable; it is then bound
successively to each tuple. If a tuple satisfies the conditions specified in the query, the
attributes FNAME, LNAME, and
ADDRESS are retrieved for each such tuple.
The conditions EMPLOYEE (t) and DEPARTMENT(d) specify the range relations for t
and d. The condition d.DNAME =‘Research’ is a selection condition and corresponds to
a SELECT operation in the relational algebra, whereas the condition d.DNUMBER =
t.DNO is a JOIN condition.
Exclude from the universal quantification all tuples that we are not interested in
by making the condition true for all such tuples. The first tuples to exclude are those that
are not in the relation R
of interest.
In query above, using the expression not(PROJECT(x)) inside the universally quantified
formula evaluates to true all tuples x that are not in the PROJECT relation. Then we
exclude the tuples we are not interested in from R itself. The expression not(x.DNUM=5)
evaluates to true all tuples x that are in the project relation but are not controlled by
department 5.
Finally, we specify a condition that must hold on all the remaining tuples in R.
(( (w) and w.ESSN=e.SSN and x.PNUMBER=w.PNO)
The language SQL is based on tuple calculus. It uses the basic SELECT <list of
attributes>FROM <list of relations>WHERE <conditions> block structure to express the
queries in tuple calculus where the SELECT clause mentions the attributes being
projected, the FROM clause mentions the relations needed in the query, and the WHERE
clause mentions the selection as well as the join conditions.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SQL syntax is expanded further to accommodate other operations. Another language
which is based on tuple calculus is QUEL which actually uses the range variables as in
tuple calculus.
Page 107
Advanced RDBMS
The language called QBE (Query-By-Example) that is related to domain calculus was
developed almost concurrently to SQL at IBM Research, Yorktown Heights, New
York. Domain calculus was thought of as a way to explain what QBE does.
Domain calculus differs from tuple calculus in the type of variables used in formulas:
rather than having variables range over tuples, the variables range over single values
from domains of attributes. To form a relation of degree n for a query result, we must
have n of these domain variables—one for each attribute.
An expression of the domain calculus is of the form {x1, x2, ..., xn |COND(x1, x2, ..., xn,
xn+1, xn+2, .., xn+m)} where x1, x2, .., xn, xn+1, xn+2, .., xn+m are domain variables
that range over domains and COND is a condition or formula of the domain relational
calculus.
Retrieve the birthdate and address of the employee whose name is ‘John B.Smith’.
Query :
{uv |( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z)
(EMPLOYEE(qrstuvwxyz) and q=’John’ and r=’B’ and s=’Smith’)}
Ten variables for the employee relation are needed, one to range over the
domain of each attribute in order. Of the ten variables q, r, s, .., z, only u and v are free.
Specify the requested attributes, BDATE and ADDRESS, by the free domain
variables u for BDATE and v for ADDRESS.
Specify the condition for selecting a tuple following the bar (|)—namely, that the
sequence of values assigned to the variables qrstuvwxyz be a tuple of the employee
relation and that the values for q(FNAME), r(MINIT), and s(LNAME) be ‘John’, ‘B’,
and ‘Smith’, respectively.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
3.2.6. QBE: A Query Language Based on Domain Calculus
This language is based on the idea of giving an example of a query using example
elements.
An example element stands for a domain variable and is specified as an example value
preceded by the underscore character.
P. (called Pdot) operator (for “print”) is placed in those columns which are requested for
the result of the query.
Page 108
Advanced RDBMS
A user may initially start giving actual values as examples, but later can get used to
providing a minimum number of variables as example elements.
QBE was fully developed further with facilities for grouping, aggregation, updating etc.
and is shown to be equivalent to SQL.
The language is available under QMF (Query Management Facility) of DB2 of IBM and
has been used in various ways by other products like ACCESS of Microsoft, PARADOX.
QBE Examples
QBE initially presents a relational schema as a “blank schema” in which the user fills in
the query as an example:
The following domain calculus query can be successively minimized by the user as
shown:
ANNAMALAI
ANNAMALAI UNIVERSITY
Query : UNIVERSITY
{uv |( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z)
(EMPLOYEE(qrstuvwxyz) and q=’John’ and r=’B’ and s=’Smith’)}
Page 109
Advanced RDBMS
A technique called the “ condition box” is used in QBE to state more involved Boolean
expressions as conditions.
The D.4(a) gives employees who work on either project 1 or 2, whereas the query in
D.4(b) gives those who work on both the projects.
Illustrating join in QBE. The join is simple accomplished by using the same example
element in the columns being joined. Note that the Result is set us as an independent
table
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Optimisation is the process of choosing the most efficient way to execute a SQL
statement. The cost-based optimiser uses statistics to calculate the selectivity of
Page 110
Advanced RDBMS
Partitioned schema objects may contain multiple sets of statistics. They can have
statistics which refer to the entire schema object as a whole ( global statistics ), they can
have statistics which refer to an individual partition, and they can have statistics which
refer to an individual sub-partition of a composite partitioned object.
Unless the query predicate narrows the query to a single partition, the optimiser
uses the global statistics. Because most queries are not likely to be this restrictive,
it is most important to have accurate global statistics. Therefore, actually
gathering global statistics with the DBMS_STATS package is highly
recommended.
The PL/SQL package DBMS_STATS lets you generate and manage statistics for
cost-based optimization. For partitioned tables and indexes, DBMS_STATS can
gather separate statistics for each partition as well as global statistics for the entire
table or index.
execdbms_stats.gather_schema_stats( -
ownname => 'ABC', -
estimate_percent => 0.5, -
method_opt => 'FOR ALL COLUMNS SIZE 1', -
degree => 8, -
granularity => 'ALL', -
options => 'GATHER STALE', - cascade => TRUE -
);
Page 111
Advanced RDBMS
a. Unused Index
Assume you have a table with some thousand or more records. Every record has a
type field to indicate type of this entry. The distribution of the type is:
COUNT(*) TYPE
---------- ----------
94 0
3011 1
If you select all records of type 0 the optimiser should take the index on type
column for optimal performance. However the optimiser decides to run a full
table scans instead:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=25 Card=1400
Bytes=64400)
1 0 SORT (ORDER BY) (Cost=25 Card=1400 Bytes=64400)
2 1 TABLE ACCESS (FULL) OF 'MAIL_SERVER' (Cost=3
Card=1400 Bytes=6...
Even if you re-calculate the global statistics after creating the index or after data
load the optimiser does not use this index.
What’s wrong with the optimiser?
b. Use Histograms
The cost-based optimiser uses data value histograms to get accurate estimates of
the distribution of column data. Histograms provide improved selectivity
estimates in the presence of data skew, resulting in optimal execution plans with
non-uniform data distributions.
Histograms can affect performance and should be used only when they
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
substantially improve query plans. They are useful only when they reflect the
current data distribution of a given column. If the data distribution of a column
changes frequently, you must re-compute its histogram frequently.
One approach to gather histogram statistics on specified tables or table columns is
using the GATHER_TABLE_STATS procedure in the same package:
exec dbms_stats.gather_table_stats( -
ownname => 'ABC', -
Page 112
Advanced RDBMS
tabname =>'MAIL_SERVER', -
method_opt => 'FOR COLUMNS SIZE 10 SERVER_TYPE', -
degree => 8 -
);
The same query for all records of type 0 will result in the following execution
plan:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=94
Bytes=4606)
1 0 SORT (ORDER BY) (Cost=6 Card=94 Bytes=4606)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'MAIL_SERVER'
(Cost=3 Card=94...
3 2 INDEX (RANGE SCAN) OF 'IDX_MAISER_SERVER_TYPE'
(Cost=1 Card=94)
But if you revert the query by selecting all records of type 1 the optimiser
makes a full table scan which is the optimal solution in this case:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=53 Card=3011
Bytes=147539)
1 0 SORT (ORDER BY) (Cost=53 Card=3011 Bytes=147539)
2 1 TABLE ACCESS (FULL) OF 'MAIL_SERVER' (Cost=3
Card=3011 Bytes=...
The “price” of this approach (using histograms) is that you are losing global
statistics. This is true for Oracle 8i and should not be a limitation on Oracle 9i.
With Oracle 8i you have to decide whether you need global statistics or
histograms on the same table or index. Alternative solutions might be:
ANNAMALAI
ANNAMALAI UNIVERSITY
o UNIVERSITY
Switch the optimiser: The hint /*+ RULE */ would switch from
cost-based to rule-based optimiser. And the rule-based optimiser
assumes that using an index is the best solution.
o Use an index hint: The hint /*+ INDEX(<table name> <index
name>) */ would lead the cost-based optimiser to make use of the
given index.
o Migrate to Oracle 9i: Maybe you have the chance or one more
argument to migrate!
Page 113
Advanced RDBMS
The package DBMS_STATS can be used to gather global statistics. Please note
that Oracle 8i currently does not gather global histogram statistics. It is most
important to have accurate global statistics for partitioned schema objects.
Histograms can affect performance and should be used only when they
substantially improve query plans. But Oracle 8i does not support global statistics
and histograms on the same objects. The database designer has to decide how to
go around this limitation. Possible alternative solutions are optimiser hints.
We present a technique for semantic query optimization (SQO) for object databases. We
use the ODMG-93 standard ODL and OQL languages. The ODL object schema and the
OQL object query are translated into a DATALOG representation. Semantic knowledge
about the object model and the particular application is expressed as integrity constraints.
This is an extension of the ODMG-93 standard. SQO is performed in the DATALOG
representation and an equivalent logic query, and subsequently an equivalent OQL
The principle of semantic query optimization (King1981) is to use semantic rules, such as
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
all Tunisian seaports have railroad access, to reformulate a query into a less expensive
but equivalent query, so as to reduce the query evaluation cost. For example, suppose we
have a query to _nd all Tunisian seaports with rail road access and 2,000,000 ft3 of
storage space. From the rule given above, we can reformulate the query so that there is no
need to check the railroad accss of seaports, which may save some execution time.
Two queries are semantically equivalent if they return the same answer for
any database state satisfying a given set of integrity constraints.
Page 114
Advanced RDBMS
Single-User System: At most one user at a time can use the system.
Transaction: logical unit of database processing that includes one or more access
operations (read -retrieval,write -insert or update, delete).
An application program may contain several transactions separated by the Begin and
End transaction boundaries.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Basic operations are read and write
read_item(X): Reads a database item named X into a program variable. To simplify our
notation, we assume that the program variable is also named X.
write_item(X): Writes the value of program variable X into the database item named X.
Page 115
Advanced RDBMS
Basic unit of data transfer from the disk to the computer main memory is one block. In
general, a data item (what is read or written) will be the field of some record in the
database, although it may be a larger unit such as a record or even a whole block.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
This occurs when two transactions that access the same database items have their
operations interleaved in a way that makes the value of some database item incorrect.
This occurs when one transaction updates a database item and then the transaction fails
for some reason. The updated item is accessed by another transaction before it is changed
back to its original value.
Page 116
Advanced RDBMS
6. Physical problems and catastrophes: This refers to an endless list of problems that
includes power or air-conditioning failure, fire, theft, sabotage, overwriting disks or tapes
by mistake, and mounting of a wrong tape by the operator.
Page 117
Advanced RDBMS
A transaction is an atomic unit of work that is either completed in its entirety or not
done at all. For recovery purposes, the system needs to keep track of when the transaction
starts, terminates, and commits or aborts.
Transaction states:
Active state
Partially committed state
Committed state
Failed state
Terminated State
read or write: These specify read or write operations on the database items that are
executed as part of a transaction.
end_transaction: This specifies that read and write transaction operations have ended
and marks the end limit of transaction execution. At this point it may be necessary to
check whether the changes introduced by the transaction can be permanently applied to
the database or whether the transaction has to be aborted because it violates concurrency
control or for some other reason.
commit_transaction: This signals a successful end of the transaction so that any changes
( updates) executed by the transaction can be safely committed to the database and will
not be undone.
rollback (or abort): This signals that the transaction has ended unsuccessfully, so that
any changes or effects that the transaction may have applied to the database must be
undone.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
undo: Similar to rollback except that it applies to a single operation rather than to a
whole transaction.
redo: This specifies that certain transaction operations must be redone to ensure that all
the operations of a committed transaction have been applied successfully to the database.
Page 118
Advanced RDBMS
Log or Journal :
The log keeps track of all transaction operations that affect the values of database items.
This information may be needed to permit recovery from transaction failures. The log is
kept on disk, so it is not affected by any type of failure except for disk or catastrophic
failure. In addition, the log is periodically backed up to archival storage (tape) to guard
against such catastrophic failures.
The transaction-id that is generated automatically by the system and is used to identify
each transact
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Recovery using log records
If the system crashes, we can recover to a consistent database state by examining the log
and using one of the techniques
1. Because the log contains a record of every write operation that changes the value of
some database item, it is possible to undo the effect of these write operations of a
transaction T by tracing backward through the log and resetting all items changed by a
write operation of T to their old_values.
2. We can also redo the effect of the write operations of a transaction T by tracing
forward through the log and setting all items changed by a write operation of T to their
new_values.
Page 119
Advanced RDBMS
Definition: A transaction T reaches its commit point when all its operations that access
the database have been executed successfully and the effect of all the transaction
operations on the database has been recorded in the log. Beyond the commit point, the
transaction is said to be committed, and its effect is assumed to be permanently recorded
in the database. The transaction then writes an entry [ commit,T] into the log.
Redoing transactions: Transactions that have written their commit entry in the log must
also have recorded all their write operations in the log; otherwise they would not be
committed, so their effect on the database can be redone from the log entries.
( Notice that the log file must be kept on disk. At the time of a system crash, only the log
entries that have been written back to disk are considered in the recovery process because
the contents of main memory may be lost.)
Force writing a log: before a transaction reaches its commit point, any portion of the log
that has not been written to the disk yet must now be written to the disk. This process is
called force- writing the log file before committing a transaction
a. ACID properties:
Consistency preservation: A correct execution of the transaction must take the database
from one consistent state to another.
Isolation: A transaction should not make its updates visible to other transactions until it
is committed; this property, when enforced strictly, solves the temporary update problem
and makes cascading rollbacks of transactions unnecessary.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Durability or permanency: Once a transaction changes the database and the changes
are committed, these changes must never be lost because of subsequent failu
Page 120
Advanced RDBMS
It is an ordering of the operations of the transactions subject to the constraint that, for
each transaction Ti that participates in S, the operations of T1 in S must appear in the
same order in which they occur in T1. Note, however, that operations from other
transactions Tj can be interleaved with the operations of Ti in S.
Cascadeless schedule: One where every transaction reads only the items that are written
by committed transactions.
Strict Schedules: A schedule in which a transaction can neither read or write an item X
until the last transaction that wrote X has committed.
Serializability of Schedules
Serial schedule: A schedule S is serial if, for every transaction T participating in the
schedule, all the operations of T are executed consecutively in the schedule. Otherwise,
the schedule is called nonserial schedule.
Result equivalent: Two schedules are called result equivalent if they produce the same
final state of the database.
Conflict equivalent: Two schedules are said to be conflict equivalent if the order of any
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
two conflicting operations is the same in both schedules.
Being serializable is not the same as being serial. Being serializable implies that the
schedule is a correct schedule. It will leave the database in a consistent state. The
interleaving is appropriate and will result in a state as if the transactions were serially
executed, yet will achieve efficiency due to concurrent execution.
Page 121
Advanced RDBMS
Practical approach
Come up with methods (protocols) to ensure serializability. It’s not possible to determine
when a schedule begins and when it ends. Hence, we reduce the problem of checking the
whole schedule to checking only a committed project of the schedule
Two schedules are said to be view equivalent if the following three conditions hold:
1. The same set of transactions participates in S and S’, and S and S’ include the same
operations of those transactions.
2. For any operation Ri(X) of Ti in S, if the value of X read by the operation has been
written by an operation Wj(X) of Tj (or if it is the original value of X before the schedule
started), the same condition must hold for the value of X read by operation Ri(X) of Ti in
S’.
3. If the operation Wk(Y) of Tk is the last operation to write item Y in S, then Wk(Y) of
Tk must also be the last operation to write item Y in S’.
As long as each read operation of a transaction reads the result of the same write
operation in both schedules, the write operations of each transaction musr produce the
same results.
“The view”: the read operations are said to see the the same view in both schedules.
Any conflict serializable schedule is also view serializable, but not vice versa.
Page 122
Advanced RDBMS
Consider the following schedule of three transactions T1: r1(X), w1(X); T2: w2(X); and
T3: w3(X):
Sa is view serializable, since it is view equivalent to the serial schedule T1, T2, T3.
However, Sa is not conflict serializable, since it is not conflict equivalent to any serial
schedule.
Constructing the precedence graphs for schedules A and D from to test for conflict
serializability
. (a) Precedence graph for serial schedule A.
(b) Precedence graph for serial schedule B.
(c) Precedence graph for schedule C (not serializable).
(d) Precedence graph for schedule D (serializable, equivalent to schedule A).
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Under special semantic constraints, schedules that are otherwise not conflict serializable
may work correctly. Using commutative operations of addition and subtraction
Page 123
Advanced RDBMS
Example: bank credit / debit transactions on a given item are separable and
commutative.
With SQL, there is no explicit Begin Transaction statement. Transaction initiation is done
implicitly when particular SQL statements are encountered.
Every transaction must have an explicit end statement, which is either a COMMIT or
ROLLBACK.
Access mode: READ ONLY or READ WRITE. The default is READ WRITE unless the
isolation level of READ UNCOMITTED is specified, in which case READ ONLY is
assumed.
Diagnostic size n, specifies an integer value n, indicating the number of conditions that
can be held simultaneously in the diagnostic area.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Characteristics specified by a SET
Page 124
Advanced RDBMS
Dirty Read: Reading a value that was written by a transaction which failed.
Phantoms: New rows being read using the same read with a condition. A transaction T1
may read a set of rows from a table, perhaps based on some condition specified in the
SQL WHERE clause. Now suppose that a transaction T2 inserts a new row that also
satisfies the WHERE clause condition of T1, into the table used by T1. If T1 is repeated,
then T1 will see a row that previously did not exist, called a phantom.
Page 125
Advanced RDBMS
Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.
Data Model Operations: Operations for specifying database retrievals and
updates by referring to the concepts of the data model. Operations on the data
model may include basic operations and user-defined operations.
Integrated data means that the database may be thought of as a unification of
several otherwise distinct data files, with any redundancy among those files either
wholly or partly eliminated.
An entity is any distinguishable real world object that is to be represented in the
database; each entity will have attributes or properties.
The metadata is information about schema objects, such as tables, indexes,
views, triggers, and more.
Sorting is one of the primary algorithms used in Query processing.
Access to the data dictionary is allowed through numerous views, which are
divided into three categories: USER, ALL, and DBA.
ACID properties are Atomicity, Consistency preservation, Isolation and
Durability or permanency:
3.5 Summary
The system catalog contains information about all three levels of database
schemas: external (view definitions), conceptual (base tables), and internal
(storage and index descriptions).
SQL objects (i.e., tables, views, ...) are contained in schemas. Schemas are
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
contained in catalogs Each schema has a single owner. Objects can be referenced
with explicit or implicit catalog and schema name
Oracle's data dictionary views are mapped onto underlying base tables, but the
views form the primary interface to Oracle's metadata. Unless you have specific
reasons to go around the views directly to the underlying base tables, you should
use the views. The views return data in a much more understandable format than
you'll get from querying the underlying tables. In addition, the views make up the
interface that Oracle documents and supports. Using an undocumented interface,
i.e. the base tables, is a risky practice
Page 126
Advanced RDBMS
[Haig91] Haigh, J. T. et al., “The LDV Secure Relational DBMS Model,” In Database
Security, IV: Status and Prospects, S. Jajodia and C.E. Landwehr eds., pp. 265-269,
North Holland: Elsevier, 1991.
3.8 Assignment
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
3.9 Reference Books
Bloesch, A. and Halpin, T. (1996) “ConQuer: a Conceptual Query Language”
Proc.ER’96: 15th International Conference on Conceptual Modeling, Springer LNCS,
no. 1157.
Bloesch, A. and Halpin, T. (1997) “Conceptual Queries Using ConQuer-II” in. David W.
Embley, Robert C. Goldstein (Eds.): Conceptual Modeling - ER '97, 16th International
Conference on Conceptual Modeling, Los Angeles, California, USA, November 3-5,
1997, Proceedings. Lecture Notes in Computer Science 1331 Springer 1997
Elmasri, R. & Navathe, S. B. (2000). Fundamentals of Database Systems. (3rd ed.).
Page 127
Advanced RDBMS
3.11 Keywords
1. Data Model
2. Entity
3. Network Model
4. Data Dictionary
5. Metadata
6. Relational Algebra
7. ACID Properties
8. Semantic Query Optimization
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 128
Advanced RDBMS
UNIT – IV
Topics:
Concurrency Control Techniques
Locking Techniques for Concurrency Control
Concurrency Control Based on Timestamp Ordering
Validation Concurrency Control Techniques
Granularity of Data Items and Multiple Granularity Locking
Using Locks for Concurrency Control In Indexes
Database Recovery Techniques: Recovery Concepts
Recovery Techniques Based On Deferred Update / Immediate Update / Shadow
Paging The ARIES Recovery Algorithms
Database Security and Authorization
4.0 Introduction
Concurrency control helps in isolation among conflicting transactions that takes part in
database management. It helps to preserve the identity of every individual record or data
and helps the database to retain consistency and ease of use that helps to promote the
reliability factor.
4.1 Objective
The objective of this unit is to learn and understand the Concurrency Control Techniques
in terms of Locking, Validation Granularity of Data Items and Multiple Granularity
Locking, Database Recovery Techniques and Database Security and Authorization:
4.2 Contents
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Purpose of Concurrency Control
Page 129
Advanced RDBMS
Locking is an operation which secures (a) permission to Read or (b) permission to Write
a data item for a transaction.
Example: Lock (X). Data item X is locked in behalf of the requesting transaction.
Unlocking is an operation which removes these permissions from the data item. Example:
Unlock (X).
Data item X is made available to all other transactions. Lock and Unlock are Atomic
operations.
Two locks modes (a) shared (read) and (b) exclusive (write).
Shared mode: shared lock (X). More than one transaction can apply share lock on X for
reading its value but no write lock can be applied on X by any other transaction.
Exclusive mode: Write lock (X). Only one write lock on X can exist at any time and no
shared lock can be applied by any other transaction on X.
Lock table: Lock manager uses it to store the identify of transaction locking a data item,
the data item, lock mode and pointer to the next data item locked. One simple way to
implement a lock table is through linked list .
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Database requires that all transactions should be wellformed.
Page 130
Advanced RDBMS
B: if LOCK (X) =0
(*item is unlocked*)
then LOCK (X) (*lock the item*)
else begin
wait (until lock (X) =0) and the lock manager wakes up the transaction);
goto B
end;
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 131
Advanced RDBMS
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 132
Advanced RDBMS
Locking (Growing) Phase: A transaction applies locks (read or write) on desired data
items one at a time.
Unlocking (Shrinking) Phase: A transaction unlocks its locked data items one at a time.
Requirement: For a transaction these two phases must be mutually exclusively, that is,
during locking phase unlocking phase must not start and during unlocking phase locking
phase must not begin.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 133
Advanced RDBMS
Two-phase policy generates two locking algorithms (a) Basic and (b) Conservative.
Conservative: Prevents deadlock by locking all desired data items before transaction
begins execution.
Basic: Transaction locks data items incrementally. This may cause deadlock which is
ANNAMALAI
ANNAMALAI UNIVERSITY
dealt with. UNIVERSITY
Strict: A more stricter version of Basic algorithm where unlocking is performed after a
transaction terminates (commits or aborts and rolledback).
This is the most commonly used two-phase locking algorithm.
Page 134
Advanced RDBMS
Deadlock prevention
A transaction locks all data items it refers to before it begins execution. This way of
locking prevents deadlock since a transaction never waits for a data item. The
conservative two-phase locking uses this approach.
Deadlock occurs when each of two transactions is waiting for the other to release the lock
on an item.
In general a deadlock may involve n (n>2) transactions, and can be detected by using a
wait-for graph.
In this approach, deadlocks are allowed to happen. The scheduler maintains a wait-for-
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
graph for detecting cycle. If a cycle exists, then one transaction involved in the cycle is
selected(victim) and rolledback.
Page 135
Advanced RDBMS
Deadlock avoidance
There are many variations of two-phase locking algorithm. Some avoid deadlock by not
letting the cycle to complete. That is as soon as the algorithm discovers that blocking a
transaction is likely to create a cycle, it rolls back the transaction. Wound-Wait and
Wait-Die algorithms use time stamps to avoid deadlocks by rolling-back victim. For
example:
Starvation
Starvation occurs when a particular transaction consistently waits or restarted and never
gets a chance to proceed further. In a deadlock resolution it is possible that the same
transaction may consistently be selected as victim and rolled-back. This limitation is
inherent in all priority based scheduling mechanisms. In Wound-Wait scheme a
younger transaction may always be wounded (aborted) by a long running older
transaction which may create starvation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Deadlock Avoidance Strategies
These include:
1. No waiting: - if no lock abort and restart without waiting for deadlock.
2. Cautious waiting – waiting for a lock to be obtained else abort
3. Based on timeouts: Long waits are assumed as deadlocks and aborted.
Page 136
Advanced RDBMS
b. If the condition in part (a) does not exist, then execute write_item(X) of T and set
write_TS(X) to TS(T).
a. If write_TS(X) > TS(T), then an younger transaction has already written to the data
item so abort and roll-back T and reject the operation.
a. If TS(T) > read_TS(X), then delay T until the transaction T’ that wrote or read X has
terminated (committed or aborted).
a. If TS(T) > write_TS(X), then delay T until the transaction T’ that wrote or read X has
terminated (committed or aborted).
1. If read_TS(X) > TS(T) then abort and roll-back T and reject the operation.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
2. If write_TS(X) > TS(T), then just ignore the write operation and continue execution.
This is because the most recent writes counts in case of two consecutive writes.
3. If the conditions given in 1 and 2 above do not occur, then execute write_item(X) of T
and set write_TS(X) to TS(T).
Page 137
Advanced RDBMS
This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction. Thus unlike other mechanisms a read
operation in this mechanism is never rejected.
Side effect: Significantly more storage (RAM and disk) is required to maintain multiple
versions. To check unlimited growth of versions, a garbage collection is run when some
criteria is satisfied.
This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction. Thus unlike other mechanisms a read
operation in this mechanism is never rejected.
Side effects: Significantly more storage (RAM and disk) is required to maintain multiple
versions. To check unlimited growth of versions, a garbage collection is run when some
criteria is satisfied. Assume X1, X2, …, Xn are the version of a data item X created
by a write operation of transactions. With each Xi a read_TS (read timestamp) and a
write_TS (write timestamp) are associated.
write_TS(Xi): The write timestamp of Xi that wrote the value of version Xi. A new
version of Xi is created only by a write operation.
To ensure serializability, the following two rules are used. If transaction T issues
write_item (X) and version i of X has the highest write_TS(Xi) of all versions of X that is
also less than or equal to TS(T), and read _TS(Xi) > TS(T), then abort and rollback
T; otherwise create a new version Xi and read_TS(X) = write_TS(Xj) = TS(T).
If transaction T issues read_item (X), find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T), then return the
value of Xi to T, and set the value of read _TS(Xi) to the largest of TS(T) and the current
read_TS(Xi).
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To ensure serializability, the following two rules are used.
1. If transaction T issues write_item (X) and version i of X has the highest write_TS(Xi)
of all versions of X that is also less than or equal to TS(T), and read _TS(Xi) > TS(T),
then abort and roll-back T; otherwise create a new version Xi and
read_TS(X) = write_TS(Xj) = TS(T).
2. If transaction T issues read_item (X), find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T), then return the
Page 138
Advanced RDBMS
value of Xi to T, and set the value of read _TS(Xi) to the largest of TS(T) and the current
read_TS(Xi).
Steps
1. X is the committed version of a data item.
2. T creates a second version X’ after obtaining a write lock on X.
3. Other transactions continue to read X.
4. T is ready to commit so it obtains a certify lock on X’.
5. The committed version X becomes X’.
6. T releases its certify lock on X’, which is X now.
Note
In multiversion 2PL read and write operations from conflicting transactions can be
processed concurrently. This improves concurrency but it may delay transaction commit
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
because of obtaining certify locks on all its writes. It avoids cascading abort but like strict
two phase locking scheme conflicting transactions may get deadlocked.
In this technique only at the time of commit serializability is checked and transactions are
aborted in case of non-serializable schedules.
Three phases:
Page 139
Advanced RDBMS
Read phase: A transaction can read values of committed data items. However, updates
are applied only to local copies (versions) of the data items (in database cache).
Validation phase: Serializability is checked before transactions write their updates to the
database.
This phase for Ti checks that, for each transaction Tj that is either committed or is in its
validation phase, one of the following conditions holds:
2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no
items in common with the write_set of Tj
3. Both the read_set and write_set of Ti have no items in common with the write_set of
Tj, and Tj completes its ead phase.
When validating Ti, the first condition is checked first for each transaction Tj, since
(1) is the simplest condition to check.
If (1) is false then (2) is checked and if (2) is false then (3 ) is checked. If none of these
conditions holds, the validation fails and Ti is aborted.
Write phase: On a successful validation transactions’ updates are applied to the database;
otherwise, transactions are restarted.
A lockable unit of data defines its granularity. Granularity can be coarse (entire database)
or it can be fine Data item granularity significantly affects concurrency control
performance. Thus, the degree of concurrency is low for coarse granularity and high for
fine granularity. Example of data item granularity:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 140
Advanced RDBMS
1.
A field of a database record (an attribute of a tuple).
2. A database record (a tuple or a relation).
3. A disk block.
4. An entire file.
5. The entire database.
To manage such hierarchy, in addition to read and write, three additional locking modes,
called intention lock modes are defined:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The set of rules which must be followed for producing serializable schedule are
Page 141
Advanced RDBMS
4. A node N can be locked by T in X, IX, or SIX mode only if the parent of N is already
locked by T in either IX or SIX mode.
5. T can lock a node only if it has not unlocked any node (to enforce 2PL policy).
6. T can unlock a node, N, only if none of the children of N are currently locked by T.
Real-time database systems are expected to rely heavily on indexes to speed up data
access and, thereby, help more transactions meet their deadlines. Accordingly, high-
performance index concurrency control (ICC) protocols are required to prevent
contention for the index from becoming a bottleneck. A new real-time ICC protocol
called GUARD-link augments the classical B-link protocol with a feedback-based
admission control mechanism and also supports both point and range queries, as well as
the undos of the index actions of aborted transactions. The performance metrics used in
evaluating the ICC protocols are the percentage of transactions that miss their deadlines
and the fairness with respect to transaction type and size.
The Database can be updated immediately, but an update operation must be recorded in
the log before it is applied to the database.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In a single-user system, if a failure occurs, it undone all operations
When concurrent execution is permitted, the recovery process depends on the protocols
used for concurrency control. For example, a strict two phase locking protocol does not
allow a transaction to read or write an item unless the transaction that last wrote the item
has committed
Database recovery refers to the Process of restoring database to a correct state in the
event of a failure. The Need for Recovery Control involves:
• Two types of storage: volatile (main memory) and nonvolatile.
Page 142
Advanced RDBMS
Failure types
The Failure types could be different based on - System crashes, resulting in loss of main
memory, Media failures, resulting in loss of parts of secondary storage, Application
software errors, Natural physical disasters, Carelessness or unintentional destruction of
data or facilities and Sabotage.
Checkpoint is defined as a Point of synchronization between database and log file. All
buffers are force-written to secondary storage. Checkpoint record is created containing
identifiers of all active transactions. When failure occurs, redo all transactions that
committed since the checkpoint and undo all transactions active at time of crash.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
If database has been damaged, there is a Need to restore last backup copy of database and
reapply updates of committed transactions using log file. If database is only inconsistent,
there is a Need to undo changes that caused inconsistency. This may also need to redo
some transactions to ensure updates reach secondary storage. This does not need backup,
but can restore database using before and after-images in the log file.
Page 143
Advanced RDBMS
• Immediate Update
• Shadow Paging.
Deferred Updates
• Updates are not written to the database until after a transaction has reached its commit
point.
• If transaction fails before commit, it will not have modified database and so no undoing
of changes required.
• May be necessary to redo updates of committed transactions as their effect may not
have reached database.
Immediate Updates
• Updates are applied to database as they occur.
• Need to redo updates of committed transactions following a failure.
• May need to undo effects of transactions that had not committed at time of failure.
• Essential that log records are written before write to database called as - Write-ahead
log protocol.
• If no "transaction commit" record in log, then that transaction was active at failure and
must be undone.
• Undo operations are performed in reverse order in which they were written to log.
Shadow Paging
• Maintain two page tables during life of a transaction: current page and shadow page
table.
• When transaction starts, two pages are the same.
• Shadow page table is never changed thereafter and is used to restore database in event
of failure.
• During transaction, current page table records all updates to database.
• When transaction completes, current page table becomes shadow page table.
This recovery scheme does not require the use of a log in a single-user environment. In a
multiuser environment, a log may be needed for the concurrency control method.
When a transaction begins executing, the current directory, whose entries point to the
most recent or current database pages on disk, is copied into a shadow directory. The
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
shadow directory is then saved on disk while the current directory is used by the
transaction.
When a write item operation is performed, a new copy of the modified database page is
created.
Page 144
Advanced RDBMS
2. Repeating history during redo: ARIES will retrace all actions of the database system
prior to the crash to reconstruct the database state when the crash occurred.
3. Logging changes during undo: It will prevent ARIES from repeating the completed
undo operations if a failure occurs during recovery, which causes a restart of the recovery
process.
1. Analysis: step identifies the dirty (updated) pages in the buffer and the set of
transactions active at the time of crash. The appropriate point in the log where redo is to
start is also determined.
1. Previous LSN of that transaction: It links the log record of each transaction. It is like a
back pointer points to the previous record of the same transaction.
2. Transaction ID
Page 145
Advanced RDBMS
For efficient recovery following tables are also stored in the log during checkpointing:
Transaction table: Contains an entry for each active transaction, with information such
as transaction ID, transaction status and the LSN of the most recent log record for the
transaction.
Dirty Page table: Contains an entry for each dirty page in the buffer, which includes the
page ID and the LSN corresponding to the earliest update to that page.
Checkpointing
2. Writes an end_checkpoint record in the log. With this record the contents of transaction
table and dirty page table are appended to the end of the log.
3. Writes the LSN of the begin_checkpoint record to a special file. This special file is
accessed during recovery to locate the last checkpoint information.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
To reduce the cost of checkpointing and allow the system to continue to execute
transactions, ARIES uses “fuzzy checkpointing”.
Page 146
Advanced RDBMS
log and transaction table may be modified. The analysis phase compiles the set of redo
and undo to be performed and ends.
2. Redo phase: Starts from the point in the log up to where all dirty pages have been
flushed, and move forward to the end of the log. Any change that appears in the dirty
page table is redone.
3. Undo phase: Starts from the end of the log and proceeds backward while performing
appropriate undo. For each undo it writes a compensating record in the log.
The recovery completes at the end of undo phase
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
4.2.8 Recovery In Multi Database System
A multidatabase system is a special distributed database system where one node may be
running relational database system under Unix, another may be running object-oriented
system under window and so on. A transaction may run in a distributed fashion at
multiple nodes. In this execution scenario the transaction commits only when all these
multiple nodes agree to commit individually the part of the transaction they were
executing. This commit scheme is referred to as “two-phase commit” (2PC). If any
one of these nodes fails or cannot commit the part of the transaction, then the transaction
is aborted. Each node recovers the transaction under its own recovery protocol.
Page 147
Advanced RDBMS
In some cases a single transaction (called a multidatabase transaction) may require access
to multiple database.
Phase -1 : When all participating databases signal the coordinator that the part of the
multidatabase transaction involving each has concluded, the coordinator sends a message
“ prepare for commit”, participating databases send OK, according to the result of their
force-write.
Phase-2: If all participating databases reply “OK” the transaction is successful and the
coordinator sends a “commit” signal to the participating databases
A key assumption has been that the system log is maintained on the disk and is not lost
as a result of the failure.
The recovery manager of a DBMS must also be equipped to handle more catastrophic
failures such as disk crashes.
The main technique used to handle such cases is that of database backup . The whole
database and the log are periodically copied onto a cheap storage medium such as
magnetic tapes.
The most common form of access control in a relational database is the view (for a
detailed discussion of relational databases, see [RobCor93]). The view is a logical table,
which is created with the SQL VIEW command.
This table contains data from the database obtained by additional SQL commands such as
JOIN and SELECT. If the database is unclassified, the source for the view is the entire
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
database. If, on the other hand, the database is subject to multilevel classification, then
the source for the view is that subset of the database that is at or below the classification
level of the user. Users can read or modify data in their view, but the view prohibits users
from accessing data at a classification level above their own. In fact, if the view is
properly designed, a user at a lower classification level will be unaware that data exists at
a higher classification level [Denn87a].
In order to define what data can be included in a view source, all data in the database
must receive an access classification. Denning [Denn87a] lists several potential access
classes that can be applied.
Page 148
Advanced RDBMS
These include:
(1) Type dependent: Classification is determined based on the attribute associated with
the data.
(2) Value dependent: Classification is determined based on the value of the data.
(3) Source level: Classification of the new data is set equivalent to the classification of
the data source.
(4) Source label: The data is arbitrarily given a classification by the source or by the user
who enters the data.
Classification of data and development of legal views become much more complex when
the security goal includes the reduction of the threat of inference attacks. Inference is
typically made from data at a lower classification level that has been derived from higher
level data. The key to this relationship is the derivation rule, which is defined as the
operation that creates the derived data (for example, a mathematical equation). A
derivation rule also specifies the access class of the derived data. To reduce the potential
for inference, however, the data elements that are inputs to the derivation must be
examined to determine whether one or more of these elements are at the level of the
derived data. If this is the case, no inference problem exists. If, however, all the elements
are at a lower level than the derived data, then one or more of the derivation inputs must
be promoted to a higher classification level [Denn87a].
The use of classification constraints to counter inference, beyond the protections provided
by the view, requires additional computation. Thuraisingham and Ford [ThurFord95]
discuss one way that constraint processing can be implemented. In their model,
constraints are processed in three phases. Some constraints are processed during design
(these may be updated later), others are processed when the database is queried to
authorize access and counter inference, and many are processed during the update phase.
Their strategy relies on two inference engines, one for query processing and one for
update processing. Essentially, the inference engines are middlemen, which operate
between the DBMS and the interface (see figure 1). According to Thuraisingham and
Ford, the key to this strategy is the belief that most inferential attacks will occur as a
result of summarizing a series of queries (for example, a statistical inference could be
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
made by using a string of queries as a sample) or by interpreting the state change of
certain variables after an update.
The two inference engines work by evaluating the current task according to a set of rules
and determining a course of action. The inference engine for updates dynamically revises
the security constraints of the database as the security conditions of the organization
change and as the security characteristics of the data stored in the database change. The
inference engine for query processing evaluates each entity requested in the query, all the
data released in a specific period that is at the security level of the current query, and
relevant data available externally at the same security level. This is called the knowledge
Page 149
Advanced RDBMS
base. The processor evaluates the potential inferences from the union of the knowledge
base and the query’s potential response. If the user’s security level dominates the security
levels of all of the potential inferences, the response is allowed [ThurFord95].
The integrity constraints in the relational model can be divided into two categories:
Implicit constraints which include domain, relational, and referential constraints enforce
the rules of the relational model.
Explicit constraints enforce the rules of the organization served by the DBMS. As such,
explicit constraints are one of the two key elements (along with views) of security
protection in the relational model [BellGris92].
Typically, explicit constraints are implemented using the SQL ASSERT or TRIGGER
commands. ASSERT statements are used to prevent an integrity violation. Therefore,
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
they are applied before an update. The TRIGGER is part of a response activation
mechanism. If a problem with the existing database is detected (for example, an error is
detected after a parity check), then a predefined action is initiated [BellGris92]. More
complicated explicit constraints like multilevel classification constraints require
additional programming with a 3GL. This is the motivation for the constraint processor.
So, SQL and, consequently, the relational model alone cannot protect the database from
determined inferential attack
Page 150
Advanced RDBMS
Object-oriented Databases
While it is not the intent of this paper to present a detailed description of the object-
oriented model, the reader may be unfamiliar with the elements of a object-oriented
database. For this reason, we will take a brief look at the object-oriented model's basic
structure. For a more detailed discussion, the interested reader should see [Bert92,
Stein94, or Sud95].
(1) Object class: This variable keeps a record of the parent class that defines the object.
(3) Data stores: These variables store data in much the same way that attributes store data
in a relational tuple [MilLun92].
Methods are the actions that can be performed by the object and the actions that can be
performed on the data stored in the object variables. Methods perform two basic
functions: They communicate with other objects and they perform reads and updates on
the data in the object. Methods communicate with other objects by sending messages.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
When a message is sent to an object, the receiving object creates a subject. Subjects
execute methods; objects do not. If the subject has suitable clearance, the message will
cause the subject to execute a method in the receiving object. Often, when the action at
the called object ends, the subject will execute a method that sends a message to the
calling object indicating that the action has ended [MilLun92].
Methods perform all reading and writing of the data in an object. For this reason, we say
that the data is encapsulated in the object. This is one of the important differences
between object-oriented and relational databases [MilLun92]. All control for access,
Page 151
Advanced RDBMS
modification, and integrity start at the object level. For example, if no method exists for
updating a particular object's variable, then the value of that variable is constant. Any
change in this condition must be made at the object level.
Access Controls
As with the relational model, access is controlled by classifying elements of the database.
The basic element of this classification is the object. Access permission is granted if the
user has sufficient security clearance to access the methods of an object. Millen and Lunt
[MilLun92] describe a security model that effectively explains the access control
concepts in the object-oriented model. Their model is based on six security properties:
Property 1 (Hierarchy Property). The level of an object must dominate that of its class
object.
Property 2 (Subject Level Property). The security level of a subject dominates the level
of the invoking subject and it also dominates the level of the home object.
Property 3 (Object Locality Property). A subject can execute methods or read or write
variables only in its home object.
Property 4 (*-Property) A subject may write into its home object only if its security is
equal to that of the object.
Property 5 (Return value property) A subject can send a return value to its invoking
subject only if it is at the same security level as the invoking subject.
Property 1 ensures that the object that inherits properties from its parent class has at least
the same classification level as the parent class. If this were not enforced, then users
could gain access to methods and data for which they do not have sufficient clearance.
Property 2 ensures that the subject created by the receiving object has sufficient clearance
to execute any action from that object. Hence, the classification level given to the subject
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
must be equal to at least the highest level of the entities involved in the action.
Page 152
Advanced RDBMS
Property 4 states that the subject must have sufficient clearance to update data in its home
object. If the invoking subject does not have as high a classification as the called object's
subject, an update is prohibited.
Property 5 ensures that if the invoking subject from the calling object does not have
sufficient clearance, the subject in the called object will not return a value.
The object-oriented model and the relational model minimize the potential for inference
in a similar manner. Remaining consistent with encapsulation, the classification
constraints are executed as methods. If a potential inference problem exists, access to a
particular object is prohibited [MilLun92].
Integrity
As with classification constraints, integrity constraints are also executed at the object
level [MilLun92]. These constraints are similar to the explicit constraints used in the
relational model. The difference is in execution. An object-oriented database maintains
integrity before and after an update by executing constraint checking methods on the
affected objects. As we saw in section 4.1.2., a relational DBMS takes a more global
approach.
One of the benefits of encapsulation is that subjects from remote objects do not have
access to a called object's data. This is a real advantage that is not present in the
relational DBMS. Herbert [Her94] notes that an object oriented system derives a
significant benefit to database integrity from encapsulation. This benefit stems from
modularity. Since the objects are encapsulated, an object can be changed without
affecting the data in another object. So, the process that contaminated one element is less
likely to affect another element of the database.
Sudama [Sud95] states that there are many impediments to the successful implementation
of a distributed object-oriented database. The organization of the object-oriented
DDBMS is more difficult than the relational DDBMS. In a relational DDBMS, the role of
client and server is maintained. This makes the development of multilevel access controls
easier. Since the roles of client and server are not well defined in the object-oriented
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
model, control of system access and multilevel access is more difficult.
System access control for the object-oriented DDBMS can be handled at the host site in a
procedure similar to that described for the relational DDBMS. Since there is no clear
definition of client and server, however, the use of replicated multisite approval would be
impractical.
Multilevel access control problems arise when developing effective and efficient
authorization algorithms for subjects that need to send messages to multiple objects
across several geographically separate locations. According to Sudama [Sud95], there are
Page 153
Advanced RDBMS
Sudama [Sud95] notes that one standard does exist, called OSF DCE (Open Software
Foundation's Distributed Computing Environment), that is vendor-independent, but is
not strictly an object-oriented database standard.
While it does provide subject authorization, it treats the distributed object environment as
a client/server environment as is done in the relational model. He points out that this
problem may be corrected in the next release of the standard.
The major integrity concern in a distributed environment that is not a concern in the
centralized database is the distribution of individual objects. Recall that a RDBMS allows
the fragmentation of tables across sites in the system. It is less desirable to allow the
fragmentation of objects because this can violate encapsulation. For this reason,
fragmentation should be explicitly prohibited with an integrity constraint [Her94]
The DBA has a DBA account in the DBMS, Which provides powerful
capabilities that are not made available to regular database accounts and users.
1. Account creation - creates a new account and password for a user or a group of
users.
3. Privilege revocation – permits the DBA to revoke certain privileges that were
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
previously given to certain accounts.
The DBA is fully responsible for the overall security of the system.
Page 154
Advanced RDBMS
1. The account level- the DBA specifies the particular privileges that each
account holds independently of the relations in the database
( Create TABLE, Create VIEW, Drop privilege)
2. The relation level – control the privilege to access each individual relation or
view in the database (Generally known as the access matrix model, where the
rows are subjects – users, account, programs – and the columns are objects –
relations, records, columns, views, operations)
Example:
Page 155
Advanced RDBMS
The discretionary access control technique of granting and revoking privileges is an all –
or-nothing method.
The need for multilevel security exists in Government, Industry and corporate
applications
Typical security classes are top secret( TS), Secret(s), Confidential(c) and unclassified
(u), where TS>S>C>U
Two restrictions on data access based on the subject / object (s/o classifications
Page 156
Advanced RDBMS
o Deadlock situation
It is a situation where in which the two transactions wait for each other to perform
the operation and release the lock for a particular item.
o Starvation
This occurs when a specific transaction waits or restarted and never gets a chance
to proceed further.
o Time Stamp
o
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Shadow Paging
When a transaction begins executing, the current directory, whose entries point to
the most recent or current database pages on disk, is copied into a directory
known as shadow directory.
Page 157
Advanced RDBMS
o The account level- the DBA specifies the particular privileges that each
account holds independently of the relations in the database
( Create TABLE, Create VIEW, Drop privilege)
o The relation level – control the privilege to access each individual relation or
view in the database (Generally known as the access matrix model, where the
rows are subjects – users, account, programs – and the columns are objects –
relations, records, columns, views, operations)
4.5 Summary
Concurrency control helps in isolation among conflicting transactions that takes part in
database management.
In multiversion 2PL read and write operations from conflicting transactions can be
processed concurrently. This improves concurrency but it may delay transaction commit
because of obtaining certify locks on all its writes. It avoids cascading abort but like strict
two phase locking scheme conflicting transactions may get deadlocked.
The Degree of concurrency is low for coarse granularity and high for fine granularity.
When concurrent execution is permitted, the recovery process depends on the protocols
used for concurrency control.
Transaction table: Contains an entry for each active transaction, with information such
as transaction ID, transaction status and the LSN of the most recent log record for the
transaction.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Dirty Page table: Contains an entry for each dirty page in the buffer, which includes the
page ID and the LSN corresponding to the earliest update to that page.
A multidatabase system is a special distributed database system where one node may be
running relational database system under Unix, another may be running object-oriented
system under window and so on. A transaction may run in a distributed fashion at
multiple nodes. In this execution scenario the transaction commits only when all these
multiple nodes agree to commit individually the part of the transaction they were
executing.
Page 158
Advanced RDBMS
The discretionary access control technique of granting and revoking privileges is an all –
or-nothing method.
The recovery manager of a DBMS must also be equipped to handle more catastrophic
failures such as disk crashes
Statistical database security techniques must prevent the retrieval of individual data
In some cases it may be possible to infer the values of individual tuples from a sequence
of statistical queries
[BellGris92] Bell, David and Jane Grisom, Distributed Database Systems. Workinham,
England: Addison Wesley, 1992.
ANNAMALAI
ANNAMALAI UNIVERSITY
4.8 Assignment UNIVERSITY
Prepare assignment about Object-oriented Database Security.
Page 159
Advanced RDBMS
[Inf96] “Illustra Object Relational Database Management System,” Informix white paper
from the Illustra Document Database, 1996.
4.11 Keywords
1. Concurrency control
2. Time Stamp
3. Shadow Paging
4. Immediate Update
5. Deferred update
6. Dirty Page Table
7. Deadlock
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 160
Advanced RDBMS
UNIT – V
Topics:
Enhanced Data Models for Advanced Applications
Temporal Database Concepts
Spatial and Multimedia Database
Distributed Databases and Client – Server Architecture
Data Fragmentation, Replication and Allocation Techniques
Types of Distributed Database Systems
Query Processing in Distributed Databases
Overview of Concurrency Control and Recovery in Distributed Databases
Client- Server Architecture and its Relationship to Distributed Databases
Distributed Databases in Oracle
Deductive Databases
Prolog/Datalog Notation-Interpretation of Rules
Basic Interface Mechanisms for Logic Programs
5.0 Introduction
Enhanced Data Models for Advanced Applications are data models are an extension to
the Datamodels what we had already come across in Database Architecture. These
advance application is used to incorporate
Spatial and Multimedia databases Which are very eminently used in modern information
technology. The Temporal database on the other hand looks at the calendar events etc.,
The spatial databases deals with Geographical information system, Weather, maps etc.,
5.1 Objective
The objective of this lesson is to learn and understand the enhanced data model in the
Active Database and triggers, the concepts of the Distributed Database Management
System and the Security concern of the same. The problem areas of the security is
analysed. The terms prolog and DataLog notation and deductive databases.
ANNAMALAI
ANNAMALAI UNIVERSITY
5.2 Contents UNIVERSITY
5.2.1 Enhanced Data Models for Advanced Applications
Page 161
Advanced RDBMS
Triggers are executed when a specified condition occurs during insert /delete/update.
Triggers are action that fire automatically based on these conditions. Triggers follow an
Event-condition-action (ECA) model.
Event:
t, delete, update),
Condition:
Action:
When a new employees is added to a department, modify the Totalfisal of the
Department to include the new employees salary
Condition
Logically this means that we will CREATE a TRIGGER, let
us call the trigger Total fsal1
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 162
Advanced RDBMS
or not
Conditions
Row-level
specifies a row-level trigger
Statement-level
ected row
Statement-level triggers
Condition
AFTER trigger
Action
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
modifications
a. Triggers on Views
Page 163
Advanced RDBMS
An active database allows users to make the following changes to triggers (rules)
Activate
Deactivate
Drop
Immediate consideration
Deferred consideration
Detached consideration
Immediate consideration:
depending on the situation:
Before
After
Instead of
c. Triggers in SQL-99
Can alias variables inside the REFERENCING clause. Trigger examples are
Create Trigger TotalfiSal
After Update of Salary on Employee
Referencing OLD Row as O new Row as N
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
For each Row
Page 164
Advanced RDBMS
calendar organizes time into different time units for convenience. Time Representation is
in terms of
time is the time when the information from a certain transaction becomes valid.
aling with two time dimensions
a) EmpfiVT
Name ENo. Salary Dno Supervisor Name
DeptfiVT
DName DNo. Totalfisal Managerfinam
b) EmpfiTT
Name ENo. Salary Dno Supervisor Name
DeptfiTT
DName DNo. Totalfisal Managerfinam
-dimensional
Page 165
Advanced RDBMS
Range query: Finds objects of a particular type within a particular distance from a given
location
Nearest Neighbor query: Finds objects of a particular type that is nearest to a given
location
an address in Pleasanton, CA
Spatial joins or overlays: Joins objects of two types based on some spatial condition
(intersecting, overlapping, within certain distance, etc.)
-680.
R-trees
Technique for typical spatial queries. Group objects close in spatial proximity on the
d. Quad trees
Divide subspaces into equally sized areas. In the years ahead multimedia information
systems are expected to dominate our daily lives. Our houses will be wired for bandwidth
-definition TV/computer
workstations will have access to a large number of databases, including digital libraries,
image and video databases that will distribute vast amounts of multisource multimedia
content.
e. Multimedia Databases
Images: Includes drawings, photographs, and so forth, encoded in standard formats such
as bitmap, JPEG, and MPEG. Compression is built into JPEG and MPEG.
Page 166
Advanced RDBMS
Structured audio: A sequence of audio components comprising note, tone, duration, and
so forth.
Audio: Sample data generated from aural recordings in a string of bits in digitized form.
Analog recordings are typically converted into digital form before storage.
characteristics.
The distributed database has all of the security concerns of a single-site database plus
several additional problem areas. We begin our investigation with a review of the security
elements common to all database systems and those issues specific to distributed systems.
A secure database must satisfy the following requirements (subject to the specific
priorities of the intended application):
1. It must have physical integrity (protection from data loss caused by power failures or
natural disaster),
2. It must have logical integrity (protection of the logical structure of the database),
3. It must be available when needed,
4. The system must have an audit system,
5. It must have elemental integrity (accurate data),
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
6. Access must be controlled to some degree depending on the sensitivity of the data,
7. A system must be in place to authenticate the users of the system, and
8. Sensitive data must be protected from inference [Pflee89].
The following discussion focuses on requirements 5-8 above, since these security areas
are directly affected by the choice of DBMS model. The key goal of these requirements is
to ensure that data stored in the DBMS is protected from unauthorized observation or
inference, unauthorized modification, and from inaccurate updates.
Page 167
Advanced RDBMS
This can be accomplished by using access controls, concurrency controls, updates using
the two-phase commit procedure (this avoids integrity problems resulting from physical
failure of the database during a transaction), and inference reduction strategies The level
of access restriction depends on the sensitivity of the data and the degree to which the
developer adheres to the principal of least privilege (access limited to only those items
required to carry out assigned tasks).
Typically, a lattice is maintained in the DBMS that stores the access privileges of
individual users. When a user logs on, the interface obtains the specific privileges for the
user.
(2) Acceptability of access: Only authorized users may view and or modify the data. In a
single level system, this is relatively easy to implement. If the user is unauthorized, the
operating system does not allow system access. On a multilevel system, access control is
considerably more difficult to implement, because the DBMS must enforce the
discretionary access privileges of the user.
(3) Assurance of authenticity: This includes the restriction of access to normal working
hours to help ensure that the registered user is genuine. It also includes a usage analysis
which is used to determine if the current use is consistent with the needs of the registered
user, thereby reducing the probability of a fishing expedition or an inference attack.
Concurrency controls help to ensure the integrity of the data. These controls regulate the
manner in which the data is used when more than one user is using the same data
element. These are particularly important in the effective management of a distributed
system, because, in many cases, no single DBMS controls data access. If effective
concurrency controls are not integrated into the distributed system, several problems can
arise.
Bell and Grisom [BellGris92] identify three possible sources of concurrency problems:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
(1) Lost update: A successful update was inadvertently erased by another user.
(3) Unrepeatable read: Data retrieved is inaccurate because it was obtained during an
update. Each of these problems can be reduced or eliminated by implementing a suitable
locking scheme (only one subject has access to a given entity for the duration of the lock)
or a timestamp method (the subject with the earlier timestamp receives priority)
Special problems exist for a DBMS that has multilevel access. In a multilevel access
system, users are restricted from having complete data access. Policies restricting user
Page 168
Advanced RDBMS
access to certain data elements may result from secrecy requirements, or they may result
from adherence to the principal of least privilege (a user only has access to relevant
information). Access policies for multilevel systems are typically referred to as either
open or closed. In an open system, all the data is considered unclassified unless access to
a particular data element is expressly forbidden. A closed system is just the opposite. In
this case, access to all data is prohibited unless the user has specific access privileges.
Classification of data elements is not a simple task. This is due, in part, to conflicting
goals. The first goal is to provide the database user with access to all non-sensitive data.
The second goal is to protect sensitive data from unauthorized observation or inference.
For example, the salaries for all of a given firm's employees may be considered non-
sensitive as long as the employee's names are not associated with the salaries. Legitimate
use can be made of this data. Summary statistics could be developed such as mean
executive salary and mean salary by gender. Yet an inference could be made from this
data. For example, it would be fairly easy to identify the salaries of the top executives.
Another problem is data security classification. There is no clear-cut way to classify data.
Millen and Lunt [MilLun92] demonstrate the complexity of the problem:
They state that when classifying a data element, there are three dimensions:
The first dimension is the easiest to handle. Access to a classified data item is simply
denied. The other two dimensions require more thought and more creative strategies. For
example, if an unauthorized user requests a data item whose existence is classified, how
does the system respond? A poorly planned response would allow the user to make
inferences about the data that would potentially compromise it.
Page 169
Advanced RDBMS
a. Data Allocation
Four strategies regarding placement of data are:
• Centralized
• Partitioned (or Fragmented)
• Complete Replication
• Selective Replication
• Centralized: Consists of single database stored at one site with users distributed across
the network.
• Partitioned: Database partitioned into disjoint fragments, each fragment assigned to
one site.
• Complete Replication: Consists of maintaining complete copy of database at each site.
• Selective Replication: Combination of partitioning, replication, and centralization.
b. Data Fragmentation
Mixed Fragmentation
c. Available Network
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The design of distributed database systems is strongly influenced by the type of
underlying WAN or LAN. Distributed database systems involving vertical partitioning
can run only on those networks that are connected continuously - at least during the hours
when the distributed database is operational.
Networks that are not continuously connected typically do not allow transactions across
sites, but may keep local copies of remote data and refresh the copies periodically. For
example, a nightly backup might be taken. For applications where consistency is not
critical, this is acceptable. This is also acceptable for systems involving horizontal
partitioning of the data.
Page 170
Advanced RDBMS
d. Transaction Management
This is used when vertical partitioning is used and special techniques must be applied in
order to ensure that the transaction is applied in two different databases so as not to cause
inconsistency. This technique is called the two-phase commit.
e. Replication
Replication is the process of synchronizing several copies of the same records or record
fragments located at different sites and is used to increase the availability of data and to
speed query evaluation.
The partitioning of the data and how to select data field names and key values so
as not to cause conflicts between sites
The timing of the replication (i.e., synchronous vs. asynchronous)
Resolution of potentially conflicting updates at different sites and ways for
detecting them
Note that suppliers feel that they can handle replication and especially an asynchronous
one (i.e., copying numerous records from one database to the other).
Unless such activities are labeled remote backups, it is recommended that the DBMS
vendor provide the replication software. The supplier should not attempt to write
replication code nor buy a third party product for such a purpose.
1. Homogeneous DDBMS:
• All sites use same DBMS product (eg.Oracle)
• Fairly easy to design and manage.
2. Heterogeneous DDBMS:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
• Sites may run different DBMS products (eg. Oracle and Ingress)
• Possibly different underlying data models (eg. Relational DB and OO database)
• Occurs when sites have implemented their own databases and integration is
considered later.
Let us understand the Query Processing with respect to Employee and Department
Relations with no fragmentations.
The processing of a Distributed Query can be done based on the following strategies:
Page 171
Advanced RDBMS
1. Failure of Individual sites – when a site recovers its local data must be brought
upto date.
2. Failure of communication Link – the system must be able to deal with failure of
one or more communication links.
3. Distributed Commit – Problem is usually solved two-phase commit protocol.
4. Distributed Deadlock – Techniques for dealing with deadlocks must be followed.
Assume that you and I both read the same row from the Customer table, we both change
the data, and then we both try to write our new versions back to the database. Whose
changes should be saved? Yours? Mine? Neither? A combination? Similarly, if we
both work with the same Customer object stored in a shared object cache and try to make
changes to it, what should happen?
To understand how to implement concurrency control within your system you must start
by understanding the basics of collisions – you can either avoid them or detect and then
resolve them. The next step is to understand transactions, which are collections of
actions that potentially modify two or more entities. On modern software development
projects, concurrency control and transactions are not simply the domain of databases,
instead they are issues that are potentially pertinent to all of your architectural tiers.
a. Collisions
In Implementing Referential Integrity and Shared Business Logic, the referential integrity
challenges are implemented that result from there being an object schema that is mapped
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
to a data schema, which is a cross-schema referential integrity problems. With respect to
collisions things are a little simpler, we only need to worry about the issues with ensuring
the consistency of entities within the system of record. The system of record is the
location where the official version of an entity is located. This is often data stored within
a relational database although other representations, such as an XML structure or an
object, are also viable.
A collision is said to occur when two activities, which may or may not be full-fledged
transactions, attempt to change entities within a system of record. There are three
fundamental ways
Page 172
Advanced RDBMS
1. Dirty read. Activity 1 (A1) reads an entity from the system of record and then
updates the system of record but does not commit the change (for example, the
change hasn’t been finalized). Activity 2 (A2) reads the entity, unknowingly
making a copy of the uncommitted version. A1 rolls back (aborts) the changes,
restoring the entity to the original state that A1 found it in. A2 now has a version
of the entity that was never committed and therefore is not considered to have
actually existed.
2. Non-repeatable read. A1 reads an entity from the system of record, making a
copy of it. A2 deletes the entity from the system of record. A1 now has a copy of
an entity that does not officially exist.
3. Phantom read. A1 retrieves a collection of entities from the system of record,
making copies of them, based on some sort of search criteria such as “all
customers with first name Bill.”A2 then creates new entities, which would have
met the search criteria (for example, inserts “Bill Klassen” into the database),
saving them to the system of record. If A1 reapplies the search criteria it gets a
different result set.
b. Locking Strategies
So what can you do? First, you can take a pessimistic locking approach that avoids
collisions but reduces system performance. Second, you can use an optimistic locking
strategy that enables you to detect collisions so you can resolve them. Third, you can
take an overly optimistic locking strategy that ignores the issue completely.
Pessimistic locking: is an approach where an entity is locked in the database for the
entire time that it is in application memory (often in the form of an object). A lock either
limits or prevents other users from working with the entity in the database.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Overly Optimistic Locking: With the strategy you neither try to avoid nor detect
collisions, assuming that they will never occur. This strategy is appropriate for single
user systems, systems where the system of record is guaranteed to be accessed by only
one user or system process at a time, or read-only tables. These situations do occur. It is
important to recognize that this strategy is completely inappropriate for multi-user
systems.
Page 173
Advanced RDBMS
The term client/server was first used in the 1980s in reference to personal computers
(PCs) on a network. The actual client/server model started gaining acceptance in the late
1980s. The client/server software architecture is a versatile, message-based and modular
infrastructure that is intended to improve usability, flexibility, interoperability, and
scalability as compared to centralized, mainframe, time sharing computing.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Three tier architectures. The three tier architecture (see Three Tier Software
Architectures) (also referred to as the multi-tier architecture) emerged to overcome the
limitations of the two tier architecture. In the three tier architecture, a middle tier was
added between the user system interface client environment and the database
management server environment. There are a variety of ways of implementing this
middle tier, such as transaction processing monitors, message servers, or application
servers. The middle tier can perform queuing, application execution, and database
staging. For example, if the middle tier provides queuing, the client can deliver its request
to the middle layer and disengage because the middle tier will access the data and return
the answer to the client. In addition the middle layer adds scheduling and prioritization
Page 174
Advanced RDBMS
for work in progress. The three tier client/server architecture has been shown to improve
performance for groups with a large number of users (in the thousands) and improves
flexibility when compared to the two tier approach. Flexibility in partitioning can be a
simple as "dragging and dropping" application code modules onto different computers in
some three tier architectures. A limitation with three tier architectures is that the
development environment is reportedly more difficult to use than the visually-oriented
development of two tier applications.
Three tier architecture with transaction processing monitor technology. The most
basic type of three tier architecture has a middle layer consisting of Transaction
Processing (TP) monitor technology (see Transaction Processing Monitor Technology).
The TP monitor technology is a type of message queuing, transaction scheduling, and
prioritization service where the client connects to the TP monitor (middle tier) instead of
the database server. The transaction is accepted by the monitor, which queues it and then
takes responsibility for managing it to completion, thus freeing up the client. When the
capability is provided by third party middleware vendors it is referred to as "TP Heavy"
because it can service thousands of users.
Three tier with message server. Messaging is another way to implement three tier
architectures. Messages are prioritized and processed asynchronously. Messages consist
of headers that contain priority information, and the address and identification number.
The message server connects to the relational DBMS and other data sources.
Three tier with an application server. The three tier application server architecture
allocates the main body of an application to run on a shared host rather than in the user
system interface client environment. The application server does not drive the GUIs;
rather it shares business logic, computations, and a data retrieval engine.
Page 175
Advanced RDBMS
In developing a distributed database, one of the first questions to answer is where to grant
system access.
Bell and Grisom [BellGris92] outline two strategies:
(1) Users are granted system access at their home site.
(2) Users are granted system access at the remote site.
The first case is easier to handle. It is no more difficult to implement than a centralized
access strategy. Bell and Grisom point out that the success of this strategy depends on
reliable communication between the different sites (the remote site must receive all of the
necessary clearance information). Since many different sites can grant access, the
probability of unauthorized access increases. Once one site has been compromised, the
entire system is compromised. If each site maintains access control for all users, the
impact of the compromise of a single site is reduced (provided that the intrusion is not the
result of a stolen password).
The second strategy, while perhaps more secure, has several disadvantages. Probably the
most glaring is the additional processing overhead required, particularly if the given
operation requires the participation of several sites. Furthermore, the maintenance of
replicated clearance tables is computationally expensive and more prone to error. Finally,
the replication of passwords, even though they're encrypted, increases the risk of theft.
A third possibility offered by Woo and Lam [WooLam92] centralizes the granting of
access privileges at nodes called policy servers. These servers are arranged in a network.
When a policy server receives a request for access, all members of the network determine
whether to authorize the access of the user. Woo and Lam believe that separating the
approval system from the application interface reduces the probability of compromise.
a. Integrity
Bell and Grisom explain that local integrity constraints are bound to differ in a
heterogeneous distributed database. The differences stem from differences in the
Page 176
Advanced RDBMS
Global integrity constraints on the other hand are separated from the individual
organizations. It may not always be practical to change the organizational structure in
order to make the distributed database consistent. Ultimately, this will lead to
inconsistencies between local and global constraints. Conflict resolution depends on the
level of central control. If there is strong global control, the global integrity constraints
will take precedence. If central control is weak, local integrity constraints will.
Independent Failures are less likely to disrupt other nodes of the distributed
database. No single database failure need halt all distributed operations or be a
performance bottleneck
A data dictionary exists for each local database- a global catalog is not necessary
to access local data
The database server is the Oracle software managing a database and a client is an
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
application that requests information from a server. Each computer in a network is a node
that can host one or more databases. Each node in a distributed database system can act
as a client, a server or both depending on the situation.
The host for the HQ database is acting as a database server when a statement is issued
against its local data, but is acting as a client when it issues a statement against remote
data
Page 177
Advanced RDBMS
Since there is a ever growing technology and the development of Distributed Data
processing and Database management the growth of Client –Server technology is very
promising.
To understand the deductive database system well, some basic concepts from
mathematical logic are needed.
- term
- n-ary predicate
- literal
- (well-formed) formula
- clause and Horn-clause
- facts
- logic program
- term
A term is a constant, a variable or an expression of the form f(t1, t2, ..., tn),
where t1, t2, ..., tn are terms and f is a function symbol.
- Example: a, b, c, f(a, b), g(a, f(a, b)), x, y, g(x, y)
- n-ary predicate
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
An n-ary predicate symbol is a symbol p appearing in an expression of the form
p(t1, t2, ..., tn), called an atom, where t1, t2, ..., tn are terms. p(t1, t2, ..., tn) can only
evaluate to true or false.
-Example: p(a, b), q(a, f(a, b)), p(x, y)
- literal
A literal is either an atom or its negation.
-Example: p(a, f(a, b)), p(a, f(a, b))
- (well-formed) formula
Page 178
Advanced RDBMS
- clause
- A clause is an expression of the following form:
A1 A2 ... An B1 ... Bm
B1 ... Bm A1 ... An
consequent
antecedent
or
B1, ..., Bm A1 , ..., An
A B A B A B BA
1 1 1 1 1 1
0 1 1 0 1 1
1ANNAMALAI
0 0
ANNAMALAI 1 0 0
UNIVERSITY
UNIVERSITY
0 0 1 0 0 1
- Horn clause
A Horn clause is a clause with the head containing only
one positive atom.
Bm A1 , ..., An
- fact
Page 179
Advanced RDBMS
- logic program
A logic program is a set of Horn clauses.
Facts:
supervise(franklin, john),
supervise(franklin, ramesh),
supervise(franklin, joyce)
supervise(james, franklin),
supervise(jennifer, alicia),
supervise(jennifer, ahmad),
supervise(james, jennifer).
Rules:
superior(X, Y) supervise(X, Y),
superior(X, Y) supervise(X, Z), superior(Z, Y),
subordinary(X, Y) superior(Y, X).
There are two main alternatives for interpreting the theoretical meaning of rules:
proof theoretic, and
model theoretic interpretation
2. The deductive axioms are used to construct proofs that derive new facts from existing
facts.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Example:
1. superior(X, Y) supervise(X, Y). (rule 1)
2. superior(X, Y) supervise(X, Z), superior (Z, Y). (rule 2)
Page 180
Advanced RDBMS
1. Given a finite or an infinite domain of constant values, assign to each predicate in the
program every possible combination of values as arguments.
4. In the Herbrand base, each instantiated predicate evaluates to true or false in terms of
the given facts and rules.
5. An interpretation is called a model for a specific set of rules and the corresponding
facts if those rules are always true under that interpretation.
6. A model is a minimal model for a set of rules and facts if we cannot change any
element in the model from true to false and still get a model for these rules and facts.
Example:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Interpretation - model - minimal model
known facts:
supervise(franklin, john), supervise(franklin, ramesh),
supervise(franklin, joyce), supervise(james, franklin),
supervise(jennifer, alicia), supervise(jennifer, ahmad),
supervise(james, jennifer).
derived facts:
Page 181
Advanced RDBMS
The above interpretation is also a model for the rules (1) and (2) since each of them
evaluates always to true under the interpretation. For example,
superior(X, Y) supervise(X, Y)
The model is also the minimal model for the rule (1) and (2) and the corresponding facts
since eliminating any element from the model will make some facts or instatiated rules
evaluate to false.
For example,
eliminating supervise(franklin, john) from the model will make this fact no more
true under the interpretation;
eliminating superior (james, ramesh) will make the following rule no more true
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
under the interpretation:
- Inference mechanism
Page 182
Advanced RDBMS
a. Bottom-up mechanism
1. The inference engine starts with the facts and applies the rules to generate new facts.
That is, the inference moves forward from the facts toward the goal.
2. As facts are generated, they are checked against the query predicate goal for a match.
Example
query goal: superior(james, Y)?
rules and facts are given as above.
1.Check whether any of the existing facts directly matches the query.
2.Apply the first rule to the existing facts to generate new facts.
3.Apply the second rule to the existing facts to generate new facts.
4.As each fact is gnerated, it is checked for a match of the query goal.
Example:
known facts:
supervise(franklin, john), supervise(franklin, ramesh),
supervise(franklin, joyce), supervise(james, franklin),
supervise(jennifer, alicia), supervise(jennifer, ahmad),
supervise(james, jennifer).
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
For all other possible (X, Y) combinations supervise(X, Y) is false.
domain = {james, franklin, john, ramesh, joyce, jennifer, alicia, ahmad}
superior(james, Y)?
Page 183
Advanced RDBMS
b. Top-down mechanism
1. The inference engine starts with the query goal and attempts to find matches to the
variables that lead to valid facts in the database. That is, the inference moves backward
from the intended goal to determine facts that would satisfy the goal.
2. During the course, the rules are used to generate subgoals. The matching of these
subgoals will lead to the match of the intended goal.
Predicate has
ame
Rule
Query
variable arguments to answer the question
- is read as if and only iff
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
SUPERIOR(james,Y)?
Page 184
Advanced RDBMS
supervise(james, jennifer).
Interpretation of Rules
Proof-theoretic: Facts and rules for Ground axioms. Ground axioms contain no
variables
Rules for deductive axioms are - Deductive axioms can be used to construct new facts
This process is known as theorem proving or Proving a new fact
it is called interpretation
-theoretic proofs
predicated are true under the interpretation, the predicate at the head of the rule must also
be true
b. Minimal model
annot change any fact from true to false and still get a model for these rules
ANNAMALAI
ANNAMALAI UNIVERSITY
Query Languages UNIVERSITY
In general, query languages are formal languages to retrieve data from a database.
Standardized languages already exist to retrieve information from different types of
databases such as Structured Query Language (SQL) for relational databases and Object
Query Language (OQL) and SQL3 for object databases.
Page 185
Advanced RDBMS
Semi-structure query languages such as XML-QL [3] operate on the document level
structure.
Logic programs consist of facts and rules where valid inference rules are used to
determine all the facts that apply within a given model.
With RDF, the most suitable approach is to focus on the underlying data model. Even
though XML-QL could be used to query RDF descriptions in their XML encoded form, a
single RDF data model could not be correctly determined with a single XML-QL query
due to the fact that RDF allows several XML syntax encodings for the same data model.
RDF provides the basis for structuring the data present in the web in a consistent and
accurate way. However, RDF is only the first step towards the construction of what Tim
Berners-Lee calls the "web of knowledge", a World Wide Web where data is structured,
and users can fully benefit by this structure when accessing information on the web. RDF
only provides the "basic vocabulary" in which data can be expressed and structured.
Then, the whole problem of accessing an managing these data structured arises.
Metalog provides a "logical" view of metadata present on the web. The Metalog approach
is composed by several components.
In the first component, a particular data semantics is established. Metalog provides way
to express logical relationships like "and", "or" and so on, and to build up complex
inference rules that encode logical reasoning. This "semantic layer" builds on top of RDF
using a so-called RDF schema.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The second component consists of a "logical interpretation" of RDF data (optionally
enriched with the semantic schema) into logic programming. This way, the understood
semantics of RDF is unwielded into its logical components (a logic program, indeed).
This means that every reasonment on RDF data can be performed acting upon the
corresponding logical view, the logic program, providing a neat and powerful way to
reason about data.
Page 186
Advanced RDBMS
The third component is a language interface to writing structured data and reasoning
rules. In principle, the first component already suffices: data and rules can be written
directly in RDF, using RDF syntax and the metalog schema. RDF syntax aims at being
more an encoding language rather than a user-friendly language, and it is well recognised
in the RDF community and among vendors that the typical applications will provide
more user-friendly interfaces between the "raw RDF" code and the user.
Another important feature of the language, in this respect, is indeed that it can be used
just as an interface to RDF, without the metalog extensions. This way, users will be able
to access and structure metadata using RDF in a smooth and seamless way, using the
metalog language.
The first correspondence in Metalog is between the basic RDF data model and the
predicates in logic. The RDF data model consists of so-called statements Statements are
triples where there is a subject (the "resource"), a predicate (the "property"), and an
object (the "literal"). Metalog views an RDF statement in the logical setting as just a
binary predicate involving the subject and the literal. For example, the RDF statement is
seen in logic programming as the predicate.
Once established the basic correspondence between the basic RDF data model and
predicates in logic, the next step comes easy: we can extend RDF so that the mapping to
logic is able to take advantage of all of the logical relationships present in logical
systems: that is to say, behind the ability of expresing static facts, we want the ability to
encode dynamic reasoning rules, like in logic programming.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
In order to do so, we need at least:
The metalog schema extends plain RDF with this "logical layer", enabling to express
arbitrary logical relationships within RDF. In fact, the metalog schema provides more
accessories besides the aforementioned basic ones (like for example, the "implies"
connector): anyway, not to heaven the discussion, we don't go into further details on this
Page 187
Advanced RDBMS
topic. What the reader should keep in mind is just that the Metalog schema provides the
"meta-logic" operators to reason with RDF statements.
Technically, this is quite easy to do: the metalog schema is just a schema as defined by
the RDF schema specification where, for example, and and or are subinstances of the
RDF Bag connector.
The mapping between "metalog RDF" and logical formulas is then completely natural:
for each RDF statement that does not use a metalog connector, there is a corresponding
logical predicate as defined before. Then, the metalog connectors are translated into the
corresponding logical connectors in the natural way (so, for instance, the metalog and
connector is mapped using logical conjunction, while the metalog or connector is mapped
using logical disjunction).
Note that the RDF metalog schema and the corresponding translation into logical
formulas is absolutely general. However, in practicse, one need also to then be able to
process the resulting logical formulas in an effective ways. In other words, while the RDF
metalog schema nicely extends RDF with the full power of first order predicate calculus,
thus increasing by far the expressibility of basic RDF, there is still the other,
computational, side of the coin: how to process and effectively reason with all these
logical inference rules.
It is well known that in general dealing with full first order predicate calculus is totally
unfeasable computationally. So, what we would like to have is a subset of predicate
calculus.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The third level is then the actual syntax interface between the user and this "metalog
RDF" encoding, with the constraint that the expressibility of the language must fit within
the one provided by logic programming.
The metalog syntax has been explicitly designed with the purpose of being totally
natural-language based, trying to avoid any possible technicalities, and therefore making
the language extrememly readable and self-descriptive.
Page 188
Advanced RDBMS
The way metalog reaches this scope is by a careful use of upper/lower case, quotes, and
by allowing a rather liberal positioning of the keywords (an advanced parser then
disambiguates the keywords from each metalog program line).
fact-based predicates are defined by listing all the combinations of values that make the
predicate true.
Rule-based predicates are defined to be the head of one or more Datalog rules. They
correspond to virtual relations whose contents can be inferred by the inference engine.
Example:
- safety of programs
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
- predicate dependency graph
Safety of programs
A rule is unsafe if one of the variables in the rule can range over an infinite domain of
values, and that variable is not limited to ranging over a finite predicate before it is
instantiated.
Page 189
Advanced RDBMS
-Example:
The evaluation of these rules (no matter whether in bottom- up or in top-down fashion)
will never terminate.
A variable X is limited if
(1) it appears in a regular (not built-in) predicate in the body of the rule.
(built-in predicates: <, >, , , =, )
(4) Before it is instantiated, some other regular predicates containing it will have been
evaluated.
Condition of safety:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
body of a rule whose head predicate is p.
Example:
superior(X, Y) supervise(X, Y),
superior(X, Y) supervise(X, Z), superior(Z, Y),
subordinary(X, Y) superior(Y, X),
supervisor(X, Y) employee(X), supervise(X, Y),
Page 190
Advanced RDBMS
-If the dependency graph for a rule set has no cycles, the rule set is nonrecursive.
3. All the rules will be evaluated along the predicate dependency graph. At each step,
each rule will be evaluated in terms of step (2).
1. Locate a set of rules S whose head involves the predicate p. If there are no such rules,
then p is a fact-based predicate corresponding to some database relation Rp; in this case,
one of the following expression is returned and the algorithm is terminated.
(b) If some arguments are constants or if the same variable appears in more than one
argument position, the expression returned is
SELECT<condition>(Rp),
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
i. if a constant c appears as argument i, include a simple condition ($i
= c) in the conjuction.
ii. if the same variable appears in both argument location j and k, include a condition
($j = $k) in the conjuction.
2. At this point, one or more rules Si, i = 1, 2, ..., n, n > 0 exist with predicate p as
their head. For each such rule Si, generate a relational expression as follows:
Page 191
Advanced RDBMS
a.Apply selection operation on the predicates in the body for each such rule, as discussed
in Step 1(b).
b.A natural join is constructed among the relations that correspond to the predicates in the
body of the rule Si over the common variables. Let the resulting relation from this join be
Rs.
c. If any built-in predicate XY was defined over the arguments X and Y, the result of the
join is subjected to an additional selection: SELECT XY(Rs)
-If the dependency graph for a rule set has at least one cycle, the rule set is recursive.
- naive strategy
- semi-naive strategy
- stratified databases
- some teminology for recursive queries
- linearly recursive
- left linearly recursive
ancestor(X, Y) ancestor(X, Z), parent(Z, Y)
- right linearly recursive
ancestor(X, Y) parent(X, Z), ancestor(Z, Y)
- non-linearly recursive sg(X,
Y) sg(X, Z), sibling(Z, W), sg(W, Y)
A Datalog equation is an equation obtained by replacing “” and “” with “=” and “ ”
in a rule, respectively.
Page 192
Advanced RDBMS
a. naive strategy
for i = 1 to n do Ri := ;
repeat
Con := true;
for i = 1 to n do Si := Ri;
for i = 1 to m do {Ri := Ei(S1, ..., Si, ..., Sn);
if Ri Si then {Con := false; Si := Ri;}}
until Con = true;
naive strategy
semi-naive strategy
For a linearly recursive rule set, Di(k) can be substituted for Ri in the k-th iteration of
the naïve algorithm.
3.The result is obtained by the union of the newly obtained term Ri and that obtained in
the previous step.
Page 193
Advanced RDBMS
for i = 1 to n do Ri := ;
for i = 1 to m do Di := ;
repeat
Con := true;
for i = 1 to n do {Di := E(D1, ..., Di, ..., Dn) - Ri;
Ri := Di Ri;
if Di then Con := false;
}
until Con is true;
Example:
Step 0: D0 = , A0 = ;
Step 1: D1 = P = {(bert, alice), (bert, george), (alice, derek), (alice, part),
(derek, frank)}
A1 = D1 A0 = {(bert, alice), (bert, george), (alice, derek), (alice,
part), (derek, frank)}
Step 2: D2 = {(bert, derek), (bert, pat), (alice, frank)}
A2 = D2 A1 = {(bert, alice), (bert, george), (alice, derek), (alice, part),
(derek, frank), {(bert, derek), (bert, pat),(alice, frank)}
Example:
The advantage of the semi-naive method is that at each step a differential term
Di is used in each equation instead of the whole Ri. In this way, the time
complexity of a computation is decreased drastically.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
-The magic-set rule rewriting technique
For example, to evaluate the query sg(john, Z)? using the following rules:
Page 194
Advanced RDBMS
a bottom-up method will generate all sg-tuples and then makes a selection operation to
the answers.
d. Stratified databases
p(X) q(X),
q(X) p(X).
To avoid the recursion via negation, we introduce the concept of stratification, which is
defined by the use of a level l mapping.
level l mapping: assign each literal in the program an integer such that if
B A1, …, An
and Ai is positive, then l(Ai) l(B) for all i, 1 i n. If Ai is negative, then l(B) < l(Ai)
for all i, 1 i n.
p(X) q(X),
ANNAMALAI
ANNAMALAI UNIVERSITY
q(X) p(X). UNIVERSITY
To avoid the recursion via negation, we introduce the concept of stratification, which is
defined by the use of a level l mapping.
level l mapping: assign each literal in the program an integer such that if
B A1, …, An
and Ai is positive, then l(Ai) l(B) for all i, 1 i n. If Ai is negative, then l(B) < l(Ai)
for all i, 1 i n.
Page 195
Advanced RDBMS
If you can assign integers to all the literals in a program using a level mapping, then this
program is stratifiable.
p(X) q(X),
q(X) p(X).
In fact, we cannot find a level mapping for any program which contains recursion via
negation.
Evaluate the literals in the program from low level to the high level.
- However, you cannot find any level mapping for the following
program:
Example:
We can many label mappings for this program. The following are
just two of them:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Deductive database systems are database management systems whose query language
and (usually) storage structure are designed around a logical model of data. As relations
are naturally thought of as the \value" of a logical predicate, and relational languages
such as SQL are syntactic sugarings of a limited form of logical expression, it is easy to
see deductive database systems as an advanced form of relational systems.
The deductive systems do, however, share with the relational systems the important
property of being declarative, that is, of allowing the user to query or update by saying
what he or she wants, rather than how to perform the operation.
Page 196
Advanced RDBMS
Another important thrust has been the problem of coping with negation or nonmonotonic
reasoning, where classical logic does not over, through the conventional means of logical
deduction, an adequate definition of what some very natural logical statements \mean" to
the programmer.
A deductive database system is a database system which can make deductions (ie:
conclude additional rules or facts) based on rules and facts stored in the (deductive)
database. Deductive database systems:
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
A good example of a declarative language would be Prolog, but for databases Datalog is
used more often. Datalog is both a syntactic subset of prolog and a database query
language – it is designed specifically for working with logic and databases. Deductive
databases are also known as logic databases, knowledge systems and inferential
databases. The problem domain of an expert system / deductive database is usually quite
narrow. Deductive databases are similar to expert systems - “traditional” expert systems
have assumed that all the facts and rules they need (their knowledge base) will be loaded
into main memory, whereas a deductive database uses a database (usually on disk
storage) as its knowledge base. Traditional expert systems have usually also taken their
facts and rules from a real expert in their problem domain, whereas deductive databases
Page 197
Advanced RDBMS
find their knowledge inherent in the data. Deductive databases and expert systems are
mainly used for:
The first rule can be interpreted as saying that individuals X and Y are at the same
generation if they are related by the predicate flat, that is, if there is a tuple (X; Y ) in the
relation for flat.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The second rule says that X and Y are also at the same generation if there are individuals
U and V such that:
These rules thus define the notion of being at the same generation recursively. Since
common implementations of SQL do not support general recursions such as this example
Page 198
Advanced RDBMS
The optimization of recursive queries has been an active research area, and has often
focused on some important classes of recursion. We say that a predicate p depends upon a
predicate q | not necessarily distinct from p | if some rule with p in the head has a subgoal
whose predicate either is q or (recursively) depends on q. If p depends upon q and q
depends upon p, p and q are said to be mutually recursive. A program is said to be linear
recursive if each rule contains at most one subgoal whose predicate is mutually recursive
with the headpredicate.
Optimization Techniques
While for nonrecursive rules, the optimization problem is similar to that of conventional
relational optimization, the presence of recursive rules opens up a variety of new options
and problems. There is an extensive literature on the subject, and we shall attempt here to
give only the most basic ideas and motivation.
Sometimes, a more restrictive definition is used, requiring that no two distinct predicates
can be mutually recursive, or even that there be at most one recursive rule in the program.
We shall not worry about such distinctions.
a. Magic Sets
The problem addressed by the magic-sets rule rewriting technique is that frequently a
query asks not for the entire relation corresponding to an intensional predicate, but for a
small subset..
A top-down, or backward-chaining search would start from the query as a goal and use
the rules from head to body to create more goals, and none of these goals would be
irrelevant to the query, although some may cause us to explore paths that happen to \dead
end," because data that would lead to a solution to the query happens not to be in the
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
database. Prolog evaluation is the best known example of top-down evaluation. However,
the Prolog algorithm, like all purely top-down approaches, sufiers from some problems. It
is prone to recursive loops, it may perform repeated computation of some subgoals, and it
is often hard to tell that all solutions to the query goal have been found.
On the other hand, a bottom-up or forward-chaining search, working from the bodies of
the rules to the heads, would cause us to infer facts that would never even be considered
in the top-down search. Yet bottom-up evaluation is desirable because it avoids the
problems of looping and repeated computation that are inherent in the top-down
approach. Also, bottom-up approaches allow us to use set-at-a-time operations like
Page 199
Advanced RDBMS
relational joins, which may be made efficient for disk-resident data, while the pure top-
down methods use tuple-at-a-time operations. Magic-sets is a technique that allows us to
rewite the rules for each query form (i.e., which arguments of the predicate are bound to
constants, and which are variable), so that the advantages of top-down and bottom-up
methods are combined. That is, we get the focus inherent in top-down evaluation
combined with the loopingfreedom, easy termination testing, and efficient evaluation of
bottom-up evaluation. Magic-sets is a rule-rewriting technique. We shall not give the
method, of which many variations are known and used in practice contains an
explanation of the basic techniques, and the following example should suggest the idea.
There are a number of other approaches to optimization that sometimes yield better
performance than magicsets.
These optimizations include the counting algorithm [BMSU86, SZ86, BR87b], the
factoring optimization [NRSU89, KRS90], techniques for deleting redundant rules and
literals [NS89, Sag88], techniques by which \existential" queries (queries for which a
single answer | any answer | suffices) can be optimized [RBK88], and \envelopes" [SS88,
Sag90]. A number of researchers [IW88, ZYT88, Sar89, RSUV89] have studied how to
transform a program that contains nonlinear rules into an equivalent one that contains
only linear rules.
c. Iterative Fixpoint Evaluation
We can improve the eficiency of this algorithm by a simple \trick." If in some round of
the repeated evaluation of the bodies we discover a new fact f, then we must have used,
for at least one of the subgoals in the utilized rule, a fact that was discovered on the
previous round. For if not, then f itself would have been discovered in a previous round.
We may thus reorganize the substitution of facts for the subgoals so that at least one of
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
the subgoals is replaced by a fact that was discovered in the previous round.
However, we lose an important property of our rules. When rules have the form
introduced in Section 2, there is a unique minimal model of the rules and data. A model
of a program is a set of facts such that for any rule, replacing body literals by facts in the
Page 200
Advanced RDBMS
model results in a head fact that is also in the model. Thus, in the context of a model, a
rule can be understood as saying, essentially, \if the body is true, the head is true".
A minimal model is a model such that no subset is a model. The existence of a unique
minimal model, or least model, is clearly a fundamental and desirable property. Indeed,
this least model is the one computed by naive or seminaive evaluation, as discussed in
Section 3.3. Intuitively, we expect the programmer had in mind the least model when he
or she wrote the logic program. However, in the presence of negated literals, a program
may not have a least model.
The origins of deductive databases can be traced back to work in automated theorem
proving and, later, logic programming. In an interesting survey of the early development
of the field [Min87], Minker suggests that Green and Raphael [GR68] were the first to
recognize the connection between theorem proving and deduction in databases. They
developed a series of question-answering systems that used a version of Robinson's
resolution principle [Rob65], demonstrating that deduction could be carried out
systematically in a database context. 5
Other early systems included MRPPS, DEDUCE-2, and DADM. MRPPS was an
interpretive system developed at Maryland by Minker's group from 1970 through 1978
that explored several search procedures, indexing techniques, and semantic query
optimization. One of the first papers on processing recursive queries was [MN82]; it
contained the first description of bounded recursive queries, which are recursive queries
that can be replaced by nonrecursive equivalents. DEDUCE was implemented at IBM in
the mid 1970's [Cha78], and supported left-linear recursive Horn-clause rules using a
compiled approach. DADM [KT81] emphasized the distinction between EDB and IDB
and studied the representation of the IDB in the form of 'connection graphs' | closely
related to Sickel's interconnectivity graphs [Sic76] | to aid in the development of query
plans.
In 1976, van Emden and Kowalski [vEK76] showed that the least fixpoint of a Horn-
clause logic program coincided with its least Herbrand model. This provided a firm
foundation for the semantics of logic programs, and especially, deductive databases, since
fixpoint computation is the operational semantics associated with deductive databases (at
Page 201
Advanced RDBMS
least, of those implemented using bottom-up evaluation). The early work focused largely
on identifying suitable goals for the field, and on developing a semantic foundation. The
next phase of development saw an increasing emphasis on the development of efficient
query evaluation techniques. Henschen and Naqvi proposed one of the earliest efficient
techniques for evaluating 5Cordell Green received a Grace Murray Hopper award from
the ACM for his work
The area of deductive databases has matured in recent years, and it now seems
appropriate to react upon what has been achieved and what the future holds. In this paper,
we provide an overview of the area and brief describe a number of projects that have led
to implemented systems.
Deductive systems are not the only class of systems with a claim to being an extension of
relational systems.
Prolog's depth- first evaluation strategy leads to infinite loops, even for positive programs
and even in the absence of function symbols or arithmetic. In the presence of large
volumes of data, operational reasoning is not desirable, and a higher premium is placed
upon completeness and termination of the evaluation method.
In a typical database application, the amount of data is sufficiently large that much of it is
on secondary storage. Efficient access to this data is crucial to good performance.
The second problem turns out to be harder. The key to accessing disk data efficiently is to
utilize the set-oriented nature of typical database operations and to tailor both the
clustering of data on disk and the management of buffers in order to minimize the
number of pages fetched from disk. Prolog's tuple-at-a-time evaluation strategy severely
curtails the implementor's ability to minimize disk accesses by re-ordering operations.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
The situation can thus be summarized as follows: Prolog systems evaluate logic programs
efficiently in main memory, but are tuple-at-a-time, and therefore inefficient with respect
to disk accesses. In contrast, database systems implement only a nonrecursive subset of
logic programs (essentially described by relational algebra), but do so efficiently with
respect to disk accesses.
The goal of deductive databases is to deal with a superset of relational algebra that
includes support for recursion in a way that permits efficient handling of disk data.
Evaluation strategies should retain Prolog's goal-directed avor, but be more set-at-a-time.
Page 202
Advanced RDBMS
5.5 Summary
Event
Condition
then condition is always true
ANNAMALAI
ANNAMALAI UNIVERSITY
Action
UNIVERSITY
FOR EACH ROW trigger specifies a row-level trigger.
An active database allows users to make the following changes to triggers
i. Activate
ii. Deactivate
iii. Drop
Time varying attribute
Key Issues in DDS are Fragmentation, Data Allocation and Replication
Datalog Program is s Logical Program.
Page 203
Advanced RDBMS
5.8 Assignments
1. Discuss in detail the Client-server architecture with Advantages.
2. Deductive Databases – discuss advantages and Disadvantages.
1.[Sud95] Sudama, Ram, “Get Ready for Distributed Objects,” Datamation, V41 n18, pp.
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
67-71, October 1995.
2.[ThurFord95] Thuraisingham, Bhavani and William Ford, “Security Constraint
Processing In A Multilevel Secure Distributed Database Management System,” IEEE
Transactions on Knowledge and Data Engineering, v7 n2, pp. 274-293, April 1995.
5.11 Keywords
Page 204
Advanced RDBMS
1. CORBA
2. COM
3. Client-Server Architecture
4. Spatial Queries
5. Fragmentation
6. Allocation
7. Replication
8. Data log
ANNAMALAI
ANNAMALAI UNIVERSITY
UNIVERSITY
Page 205