Kenya Methodist University Department of Mathematics and Computer Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 99

KENYA METHODIST UNIVERSITY

DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE

DISTANCE LEARNING MODULE

INSTRUCTION MATERIALS

COURSE CODE: COMP 340

COURSE TITLE: DATABASE MANAGEMENT SYSTEM

DLM Instruction Materials by. R.O. Ogollah


CONTENTS

1 Introduction to Database Systems............................................................................... 4


1.1 About These Notes.............................................................................................. 4
1.2 Where do we Encounter Databases?................................................................... 4
1.3 The Database Approach ...................................................................................... 5
1.3.1 DBMS vs. File Processing .......................................................................... 7
1.4 Characteristics of Database Approach ................................................................ 7
1.4.1 Self-describing via system catalogs ............................................................ 7
1.4.2 Insulation between Programs and Data, and Data abstraction.................... 8
1.4.3 Support of multiple Views of Data ............................................................. 8
1.4.4 Sharing of Data and Multi-user Transaction Processing ............................ 8
1.5 User Types .......................................................................................................... 8
1.5.1 Database Administrators............................................................................. 8
1.5.2 Database Designers..................................................................................... 9
1.5.3 End Users .................................................................................................... 9
1.6 Advantages of Using a DBMS............................................................................ 9
1.6.1 Data Independence.................................................................................... 10
1.6.2 Controlling Redundancy ........................................................................... 10
1.6.3 Enforcing Integrity Constrains:................................................................. 11
1.6.4 Restricting unauthorized access................................................................ 12
1.6.5 Providing Persistent Storage for Program Objects and Data Structures... 12
1.6.6 Permitting Inferencing and Actions Using Rules ..................................... 12
1.6.7 Providing Multiple User Interfaces........................................................... 12
1.6.8 Representing Complex Relationships Among Data ................................. 12
1.6.9 Providing Backup And Recovery ............................................................. 12
1.7 Drawbacks to DBMS ........................................................................................ 13
1.8 Database Architecture....................................................................................... 13
1.8.1 Three-schema Architecture....................................................................... 13
1.8.2 External View ........................................................................................... 14
1.8.3 Conceptual View....................................................................................... 15
1.8.4 Internal View ............................................................................................ 15
1.8.5 Mappings................................................................................................... 16
2 Database Analysis..................................................................................................... 17
2.1 Introduction....................................................................................................... 17
2.2 Database Analysis Life Cycle........................................................................... 17
2.3 Data Models ...................................................................................................... 19
2.4 Entity Relationship Model ................................................................................ 19
2.4.1 Conceptual Data Models........................................................................... 19
2.4.2 The ER Model Basics ............................................................................... 20
2.4.3 Entities ...................................................................................................... 20
2.4.4 Attribute .................................................................................................... 20
2.4.5 Keys .......................................................................................................... 21
2.4.6 Relationships............................................................................................. 21
2.4.7 Degree of a Relationship........................................................................... 22

i
2.4.8 Replacing ternary relationships................................................................. 23
2.5 Cardinality......................................................................................................... 23
2.6 Optionality ........................................................................................................ 24
2.7 Participation ...................................................................................................... 25
2.8 Entity Sets ......................................................................................................... 25
2.8.1 Confirming Correctness ............................................................................ 25
2.8.2 Deriving the relationship parameters ........................................................ 26
2.9 Redundant relationships.................................................................................... 26
2.10 Splitting n:m Relationships............................................................................... 27
2.11 Constructing an ER model ................................................................................ 27
2.12 ER Examples..................................................................................................... 29
2.13 Problems with ER Models ................................................................................ 30
2.13.1 Fan traps.................................................................................................... 31
2.14 Chasm traps....................................................................................................... 31
2.15 Enhanced ER Models (EER) ............................................................................ 32
2.15.1 Key Constraints......................................................................................... 32
2.15.2 Participation Constraints........................................................................... 33
2.15.3 Weak Entities ............................................................................................ 33
2.15.4 Aggregation............................................................................................... 34
2.15.5 Class Hierarchies ...................................................................................... 34
2.15.6 Specialisation ............................................................................................ 35
2.15.7 Generalisation ........................................................................................... 35
3 Structured Query Language ...................................................................................... 36
3.1 Database Models............................................................................................... 36
3.2 Relational Databases......................................................................................... 36
3.3 Relational Data Structure.................................................................................. 37
3.4 Domain and Integrity Constraints..................................................................... 37
3.5 Structure of a Table........................................................................................... 37
3.5.1 Columns or Attributes............................................................................... 39
3.5.2 Basic Structure:......................................................................................... 39
3.5.3 Characteristics of Relations ...................................................................... 41
3.6 Primary Keys .................................................................................................... 41
3.7 Integrity Constraints over Relations ................................................................. 42
3.7.1 Kinds of Constraints ................................................................................. 42
3.8 SQL Basics........................................................................................................ 43
4 DATABASE DESIGN.............................................................................................. 51
4.1 Schema Refinement .......................................................................................... 51
4.1.1 Problems Caused by Redundancy............................................................. 51
4.1.2 Use of Decomposition............................................................................... 52
4.1.3 Informal Design Guidelines for Relational Schemas................................ 52
4.2 Functional Dependencies .................................................................................. 52
4.3 Normalization ................................................................................................... 53
4.4 Integrity Constraints.......................................................................................... 54
4.5 Understanding Data .......................................................................................... 54
4.5.1 First Normal Form .................................................................................... 56

ii
4.6 Decomposing the relation ................................................................................. 57
4.6.1 Second Normal Form................................................................................ 59
4.6.2 Third Normal Form................................................................................... 60
4.6.3 Normalisation - BCNF.............................................................................. 62
5 Relational Algebra .................................................................................................... 69
5.1 Set Operations - semantics................................................................................ 70
6 Concurrency using Transactions............................................................................... 75
6.1 Transactions ...................................................................................................... 75
6.1.1 Properties of transactions that a DBMS must maintain ............................ 76
6.1.2 Transaction Schedules .............................................................................. 76
6.2 Lost Update scenario......................................................................................... 77
6.3 Uncommitted Dependency................................................................................ 77
6.3.1 Serialisability ............................................................................................ 78
6.3.2 Concurrency Locking................................................................................ 79
6.3.3 Deadlock ................................................................................................... 80
6.3.4 Deadlock Handling ................................................................................... 81
6.3.5 Two-Phase Locking .................................................................................. 82
6.3.6 Other Database Consistency Methods ...................................................... 82
6.4 Crash Recovery................................................................................................. 83
6.4.1 Why Recovery is Needed.......................................................................... 84
6.4.2 Typical strategy for recovery: ................................................................... 84
6.4.3 Immediate Update..................................................................................... 86
6.4.4 Rollback .................................................................................................... 87
7 DBMS Implementation............................................................................................. 88
7.1 Implementing a DBMS ..................................................................................... 88
7.1.1 Disk and Memory ..................................................................................... 90
7.2 Disk Arrangements ........................................................................................... 90
7.2.1 Hash tables ................................................................................................ 91
7.3 Decision Support............................................................................................... 92
7.3.1 Data Warehousing..................................................................................... 92
7.3.2 Data Mining .............................................................................................. 93
7.3.3 Binary Tree ............................................................................................... 94
7.3.4 Index Structure and Access....................................................................... 94
7.3.5 Costing Index and File Access.................................................................. 94
7.3.6 Use of Indexes........................................................................................... 95
7.3.7 Shadow Paging.......................................................................................... 95
8 DATABASE SECURITY......................................................................................... 97

iii
1 Introduction to Database Systems
Databases and database management systems (DBMS) have become an essential component of
everyday life in modern society. Relational database systems have become increasingly popular
since the late 1970's. They offer a powerful method for storing data in an application-
independent manner. This means that for many enterprises the database is at the core of the I.T.
strategy. Developments can progress around a relatively stable database structure, which is
secure, reliable, efficient, and transparent.

In early systems, each suite of application programs had its own independent master file. The
duplication of data over master files could lead to inconsistent data. Efforts to use a common
master file for a number of application programs resulted in problems of integrity and security.
The production of new application programs could require amendments to existing application
programs, resulting in `unproductive maintenance'.

Data structuring techniques, developed to exploit random access storage devices, increased the
complexity of the insert, delete and update operations on data. As a first step towards a DBMS,
packages of subroutines were introduced to reduce programmer effort in maintaining these data
structures. However, the use of these packages still requires knowledge of the physical
organization of the data.

1.1 About These Notes


The goal of these notes is to provide an introduction to database management systems, with an
emphasis on how to organize information in a DBMS and to maintain it and retrieve it
efficiently, that is, how to design a database and use a DBMS effectively. The notes will only
serve as a guideline, but the student should read more from the textbooks provided in the
reference for more details.
1.2 Where do we Encounter Databases?
Database technology is crucial to the operation and management of modern organizations such
as:
• In universities for keeping student information, course registration, and grades.
• In airline reservations and schedule information
• In banking for keeping customer information, accounts and loans, transactions.
• In computerized library catalogue
• In credit and transactions for purchase on credit cards and generation of monthly
statements
• In telecommunications for keeping records of calls made, generating monthly bills e.t.c
• In manufacturing industries for management of supply chain and for tracking production
of items.
• In human recourses for keeping information about employees, salaries payroll taxes and
benefits.
• In multimedia, databases can store pictures, video clips, and sound messages
• In geographic information system (GIS) to store and analyze maps, weather data, and
satellite images

4
• In real-time and active database technology, database is used in controlling industrial and
manufacturing process
• Data warehouses and online analytical processing (OLAP) systems are used in many
companies to extract and analyze information from large databases.
• Database search techniques are being applied to the World Wide Web to improve the
search for information that is needed by users browsing through the Internet.

To understand the fundamentals of database technology, we must start from the basics of
traditional database application. The next section, therefore, define what a database is together
with the definition of other terms.
1.3 The Database Approach

A database system is a computer-based system to record and maintain information. The


information concerned can be anything of significance to the organisation for whose use it is
intended.

A Database is logically coherent collection of related data stored in a consistent form for a
specific purpose such that information can be retrieved in an orderly, related, and meaningful
manner.

The contents of a database can hold a variety of different things. To make database design more
straightforward, databases contents are divided up into two concepts:

• Schema
• Data

The Schema is the structure of data, whereas the Data are the "facts". Schema can be complex to
understand to begin with, but really indicates the rules that the Data must obey.
Imagine a case where we want to store facts about employees in a company. Such facts could
include their name, address, date of birth, and salary. In a database all the information on all
employees would be held in a single storage "container", called a table. This table is a tabular
object like a spreadsheet page, with different employees as the rows, and the facts (e.g. their
names) as columns... Let's call this table EMP, and it could look something like in Table 1.1:

Table 1.1: An example database table


Name Home Address Date of Birth Salary
Jim Okello Box 287, Kisumu 1/3/1991 11000
John Kariuki Box 345, Thika 7/9/1992 13000
Emma Karimi Box 2789, Meru 3/2/1990 12000

From this information the schema would define that EMP has four components,
"NAME","ADDRESS","DOB","SALARY". As designers we can call the columns what we like,

5
but making them meaningful helps. In addition to the name, we want to try and make sure that
people don't accidentally store a name in the DOB column, or some other silly error. Protecting
the database against rubbish data is one of the most important database design steps, and will be
covered in the later chapters. From what we know about the facts, we can say things like:

• NAME is a string, and needs to hold at least 12 characters.


• ADDRESS is a string, and needs to hold at least 20 characters.
• DOB is a date... The company forbids people over 100 years old or younger than 18 years
old working for them.
• SALARY is a number. It must be greater than zero.

Such rules, also referred to as constraints, can be enforced by a database. During the design
phase of a database schema these and more complex rules are identified and where possible
implemented. The more rules the harder it is to enter poor quality data.

Database Management System (DBMS) is a collection of programs that enables users to create
and maintain database. It can also be define as a general-purpose software system that facilitates
the process of defining, constructing and maintaining database for various applications.
Defining means specifying the data types, structures and constraints as described in the example
above. Constructing is process of storing the data itself in some medium, controlled by DBMS
Manipulating involve operations such as querying to retrieve information, updating, generating
reports.

An Example
Let us consider another example of a UNIVERSITY database for maintaining information
concerning students, courses, and grades in a university environment. Figure 1.1 shows the
databse structure and a few relationships in the database.

Faculty
Registration
Entities: Assignment
students, faculty, courses,
offerings, enrollments
Relationships:
faculty teach offerings,
Grade students enroll in Course
Recording offerings, offerings made Scheduling
of courses, ...

University Database

Figure 1.1: A simple University database


To define this database, we must specify the structure of the records of each file by specifying
the different types of data elements to be stored in each record. For example, the STUDENT

6
record may contain the information to represent student's name, registration number, class,
courses, and major. Each COURSE record includes data to represent the Course Name, Course
Number, Credit hours, and department e.t.c.
Notice that the records in various files may be related.

1.3.1 DBMS vs. File Processing


The typical file-processing system stores permanent records in various files and needs different
application programs to extract records from, and add records to appropriate files
Disadvantages of file-processing systems
• Data redundancy and inconsistency: Redundancy may lead to higher storage and access
cost, duplication of efforts, and to data inconsistency.
• Difficulty in accessing data
• Data Isolation: Data are scattered in various files which may be in different formats
• Integrity problems: Data values may not satisfy certain consistency constraints
• Atomicity problems: Atomicity means that the operation on the database must happen in
its entirety or not at all
• Concurrent–access anomalies
• Security problems: Not every user of the database system should be able to access all the
data e.g confidential financial data. It’s difficult to enforce this in FPS.

1.4 Characteristics of Database Approach


A number of characteristics distinguish the database approach from the traditional approach of
programming with files. In traditional file processing, each user defines and implements the files
needed for a specific application as part of programming the application. For example, in the
University database, one user, the grade-reporting officer, may keep a file on students and their
grades. Programs to print a student's transcripts and to enter new grades into files are
implemented. A second user, the accountant, may keep track of student's fees and their
payments. Although both users are interested in data about students, each user maintains
different files and programs to maintain these files. This redundancy in defining and storing data
results in wasted storage space and in redundant efforts to maintain common data.
In the database approach, a single repository of data is maintained that is defined once and then
is accessed by various users
1.4.1 Self-describing via system catalogs
A fundamental characteristic of the database approach is that the database system contains both
database and complete definition of the database structure and constraints. This information is
stored in the system catalog and is called meta data. The catalog is used by DBMS software and
also by database users who need information about the database structure.
Whereas file-processing software can access only specific databases, DBMS software can access
diverse database by extracting definitions from catalogs and then using these definitions.

7
1.4.2 Insulation between Programs and Data, and Data abstraction
In traditional file processing, the structure of data files is embedded in the access programs, so
any changes to the structure of a file may require changing all programs that access this file. By
contrast, DBMS access programs do not require such changes in most cases. The structure of
data file is stored in the DBMS catalog separately from the access programs. This property is
called program-data independence.
A DBMS provides users with a conceptual representation of data that does not include many of
the details of how the data is stored or how the operations are implemented. This is referred to as
data abstraction.
1.4.3 Support of multiple Views of Data
A database typically has many users, each of whom may require a different perspective or view
of the database. A view may be a subset of the databases or it may contain virtual data that is
derived from the database file but is not explicitly stored.
1.4.4 Sharing of Data and Multi-user Transaction Processing
A multi-user DBMS must allow multiple users to access the database at the same time. The
DBMS must include concurrency control software to ensure that several users trying to update
the same data do so in a controlled manner so that the result of the updates is correct.
1.5 User Types
When considering users of a Database system, there are three broad classes to consider:

• The application programmer, responsible for writing programs in some high-level


language such as C++, etc.
• The end-user, who accesses the database via a query language
• The database administrator (DBA), who controls all operations on the database

For a small personal database, one person typically defines, constructs, and manipulates the
database. However, for large databases, many people are involved. We next describe the three
broad categories mentioned above.
1.5.1 Database Administrators
The DBA is responsible for authorizing access to the database, for coordinating and monitoring
its use, and for acquiring software and hardware resources as needed. The DBA is accountable
for problems such as breach of security or poor system response time.

In particular, the DBA's responsibilities include the following:

• Deciding the information content of the database, i.e. identifying the entities of interest to
the enterprise and the information to be recorded about those entities.
• Deciding the storage structure and access strategy, i.e. how the data is to be represented
by writing the storage structure definition.
• Liaising with users, i.e. to ensure that the data they require is available and to write the
necessary external schemas and conceptual/external mapping.

8
• Defining authorisation checks and validation procedures. Authorisation checks and
validation procedures are extensions to the conceptual schema and can be specified using
the DDL
• Defining a strategy for backup and recovery. For example periodic dumping of the
database to a backup tape and procedures for reloading the database for backup. Use of a
log file where each log record contains the values for database items before and after a
change and can be used for recovery purposes

• Monitoring performance and responding to changes in requirements, i.e. changing details


of storage and access thereby organising the system so as to get the performance that is

1.5.2 Database Designers


Databases designers are responsible for identifying the data to be stored in the database and for
choosing appropriate structures to represent and store this data. These tasks are mostly
undertaken before the database is actually implemented and populated with data. It is the
responsibility of the database designer to communicate with all prospective users, in order to
understand their requirements, and to come up with a design that meets these requirements.
1.5.3 End Users
There are several categories of end users:
• Casual end users occasionally access the database, but they may need different
information each time. They use a sophisticated database query language to specify their
requests and typically middle or high level managers or other occasional browsers.
• Naïve or Parametric end users make up a sizable portion of database end users. Their
main job function revolves around constantly querying and updating the database, using
standard types of queries and updates- called canned transactions that have been carefully
programmed and tested.
• Sophisticated end users include engineers, scientists, business analysts, and others who
thoroughly familiarize themselves with the facilities of the DBMS so as to implement
their application to meet their complex requirements.
• Stand-alone end users maintain personal databases by using ready-made program
packages that provide easy to use menu or graphics based interfaces.
1.6 Advantages of Using a DBMS
The facilities offered by DBMS vary a great deal, depending on their level of sophistication. In
general, however, a good DBMS should provide the following advantages over a conventional
system:

• Independence of data and program - This is a prime advantage of a database. Both the
database and the user program can be altered independently of each other thus saving
time and money, which would be required to retain consistency.
• Data shareability and nonredundance of data - The ideal situation is to enable
applications to share an integrated database containing all the data needed by the
applications and thus eliminate as much as possible the need to store data redundantly.
• Integrity - With many different users sharing various portions of the database, it is
impossible for each user to be responsible for the consistency of the values in the

9
database and for maintaining the relationships of the user data items to all other data item,
some of which may be unknown or even prohibited for the user to access.
• Centralised control - With central control of the database, the DBA can ensure that
standards are followed in the representation of data.
• Security - Having control over the database the DBA can ensure that access to the
database is through proper channels and can define the access rights of any user to any
data items or defined subset of the database. The security system must prevent corruption
of the existing data either accidentally or maliciously.
• Performance and Efficiency - In view of the size of databases and of demanding database
accessing requirements, good performance and efficiency are major requirements.
Knowing the overall requirements of the organisation, as opposed to the requirements of
any individual user, the DBA can structure the database system to provide an overall
service that is `best for the enterprise'.

1.6.1 Data Independence

• This is a prime advantage of a database. Both the database and the user program can be
altered independently of each other.
• In a conventional system applications are data dependent. This means that the way in
which the data is organised in secondary storage and the way in which it is accessed are
both dictated by the requirements of the application, and, moreover, that knowledge of
the data organisation and access technique is built into the application logic.
• For example, if a file is stored in indexed sequential form then an application must know
o that the index exists
o the file sequence (as defined by the index)

The internal structure of the application will be built around this knowledge. If, for example, the
file was to be replaced by a hash-addressed file, major modifications would have to be made to
the application.

1.6.2 Controlling Redundancy


In traditional software development utilizing file processing, every user group maintains its own
files for handling its data processing applications. This redundancy in storing the same data
multiple times leads to several problems such as:
 Need to perform a single logical update- such as entering data multiple times. This
leads to duplication of efforts.
 Storage space is wasted when the same data is stored repeatedly.
 Files that represent the same data may become inconsistent. This may happen because
an update is applied to some of the files but not to others.

Data integration is generally regarded as an important characteristic of a database. The avoidance


of redundancy should be an aim, however, the vigour with which this aim should be pursued is
open to question.

Redundancy is

10
• Direct if a value is a copy of another
• Indirect if the value can be derived from other values:
o Simplifies retrieval but complicates update
o Conversely integration makes retrieval slow and updates easier
• Data redundancy can lead to inconsistency in the database unless controlled. The system
should be aware of any data duplication - the system is responsible for ensuring updates
are carried out correctly. A DB with uncontrolled redundancy can be in an inconsistent
state - it can supply incorrect or conflicting information. A given fact represented by a
single entry cannot result in inconsistency - few systems are capable of propagating
updates i.e. most systems do not support controlled redundancy.

1.6.3 Enforcing Integrity Constrains:


Most database applications have certain integrity constraints that must hold for the data. A
DBMS should provide capabilities for defining and enforcing these constraints.
This describes the problem of ensuring that the data in the database is accurate.

• Inconsistencies between two entries representing the same `fact' give an example of lack
of integrity (caused by redundancy in the database).
• Integrity constraints can be viewed as a set of assertions to be obeyed when updating a
DB to preserve an error-free state.
• Even if redundancy is eliminated, the DB may still contain incorrect data.
• Integrity checks, which are important, are checks on data items and record types.

Integrity checks on data items can be divided into 4 groups:

• Type checks
o e.g. ensuring a numeric field is numeric and not a character - this check should be
performed automatically by the DBMS.
• Redundancy checks
o direct or indirect- this check is not automatic in most cases.
• Range checks
o e.g. to ensure a data item value falls within a specified range of values, such as
checking dates so that say (age > 0 AND age < 110).
• Comparison checks

o in this check a function of a set of data item values is compared against a function
of another set of data item values. For example, the max salary for a given set of
employees must be less than the min salary for the set of employees on a higher
salary scale.

A record type may have constraints on the total number of occurrences, or on the insertions and
deletions of records. For example in a patient database there may be a limit on the number of
xray results for each patient or the details of a patients visit to hospital must be kept for a
minimum of 5 years before it can be deleted

11
• Centralized control of the database helps maintain integrity, and permits the DBA to
define validation procedures to be carried out whenever any update operation is
attempted (update covers modification, creation and deletion).
• Integrity is important in a database system - an application run without validation
procedures can produce erroneous data which can then affect other applications using that
data.

1.6.4 Restricting unauthorized access


When multiple users share a database, it is likely that some users will not be authorized to access
all information in the database. For example, financial data is considered confidential, and hence
only authorized persons are allowed to access such data. A DBMS should provide a security and
authorization subsystems, which the DBA uses to create accounts and restrictions. The DBMS
should then enforce these restrictions automatically.

1.6.5 Providing Persistent Storage for Program Objects and Data Structures
This is one of the main reasons for the emergence of the object-oriented database.
Programming languages typically have complex data structures such as the class definition in
C++. The values of program variables are discarded once a program terminates, unless the
programmer explicitly stores them in the permanent files, which often involves converting these
complex structures into a format suitable for file storage. When the need arises to read this data
once more, the programmer must convert from the file format to the program variable structure.
Object oriented database systems are compatible with programming languages such as C++ and
JAVA, and the DBMS software automatically performs any necessary conversion. Thus object
oriented database systems typically offer data structure compatibility with one or more object-
oriented programming languages.
1.6.6 Permitting Inferencing and Actions Using Rules
Some databases systems provide capabilities for defining deductive rules for inferencing new
information from the stored database facts. Such systems are called deductive database
systems.
1.6.7 Providing Multiple User Interfaces
Because many types of users with varying levels of technical knowledge use a database, a
DBMS should provide a variety of user interfaces. These include query languages for casual
users, programming language interface for application programmers, forms and command codes
for parametric users, and menu driven interfaces and natural language interfaces for stand-alone
users.
1.6.8 Representing Complex Relationships Among Data
A database may include numerous varieties of data that are interrelated in many ways.
1.6.9 Providing Backup And Recovery
A DBMS must provide facilities for recovering from hardware or software failures. For example,
if the computer system fails in the middle of a complex update program, the recovery sub-system

12
is responsible for making sure that the database is restored to the state it was in before program
started executing.

1.7 Drawbacks to DBMS


In spite of the advantages of using DBMS, there are a few situations in which such a system may
involve unnecessary overhead costs, as that would not be incurred in traditional file processing.
The overhead cost in using DBMS are due to the following:
• High initial investment in hardware, software, and training.
• Generality that DBMS provides for defining and processing data.
• Overhead for providing security, concurrency control, recovery, and integrity functions

Additional problems may arise if the database designers and DBA do not properly design the
database or if the database application is not implemented properly. Hence, it may be more
desirable to use regular files under the following circumstances:
• The database and application are simple, well defined, and not expected to change.
• There are stringent real-time requirements for some programs that may not be met
because of DBMS overhead.
• Multiple-user access to data is not required
The following may summarize the limitation of DBMS
• Expensive
• Complex – Administration is full-time job.
• Abstraction is not free
• Overhead of query processing
• Process optimization
• Features may not be needed
 Security - single user
 Concurrency – single user
 Integrity – single application.
 Recovery – not mission critical
• Nonprocedural Access
1.8 Database Architecture
1.8.1 Three-schema Architecture
DBMSs do not all conform to the same architecture. The three-level architecture forms the basis
of modern database architectures. The architecture for DBMSs is divided into three general
levels:

• External
• Conceptual
• Internal

These are illustrated in the Figure1.2.

13
Figure 1.2: Three level architecture
• The external level: concerned with the way individual users see the data. It expresses the
properties of program/data independence and multiple ‘views’ of database

• The conceptual level: can be regarded as a community user view a formal description of
data of interest to the organisation, independent of any storage considerations.
• The internal level: concerned with the way in which the data is actually stored

1.8.2 External View


A user is anyone who needs to access some portion of the data. They may range from application
programmers to casual users with adhoc queries. Each user has a language at his/her disposal.

The application programmer may use a high level language ( e.g. C++) while the casual user will
probably use a query language. Regardless of the language used, it will include a data sub-
language DSL that is that subset of the language, which is concerned with storage and retrieval
of information in the database and may or may not be apparent to the user.
A DSL is a combination of two languages:
• a data definition language (DDL) - provides for the definition or description of database
objects
• a data manipulation language (DML) - supports the manipulation or processing of
database objects.
Each user sees the data in terms of an external view: Defined by an external schema, consisting
basically of descriptions of each of the various types of external record in that external view, and
also a definition of the mapping between the external schema and the underlying conceptual
schema.

14
Figure 1.3: How the three level architecture works

1.8.3 Conceptual View


• An abstract representation of the entire information content of the database.
• It is in general a view of the data as it actually is, that is, it is a `model' of the `realworld'.
• It consists of multiple occurrences of multiple types of conceptual record, defined in the
conceptual schema.
• To achieve data independence, the definitions of conceptual records must involve
information content only.
• Storage structure is ignored
• Access strategy is ignored
• In addition to definitions, the conceptual schema contains authorisation and validation
procedures.

1.8.4 Internal View


The internal view is a low-level representation of the entire database consisting of multiple
occurrences of multiple types of internal (stored) records.

15
It is however at one remove from the physical level since it does not deal in terms of physical
records or blocks nor with any device specific constraints such as cylinder or track sizes. Details
of mapping to physical storage is highly implementation specific and are not expressed in the
three-level architecture.

The internal view described by the internal schema:

• Defines the various types of stored record


• What indices exist
• How stored fields are represented
• What physical sequence the stored records are in

In effect the internal schema is the storage structure definition.

1.8.5 Mappings
• The conceptual/internal mapping:
o defines conceptual and internal view correspondence
o specifies mapping from conceptual records to their stored counterparts
• An external/conceptual mapping:
o defines a particular external and conceptual view correspondence
• A change to the storage structure definition means that the conceptual/internal mapping
must be changed accordingly, so that the conceptual schema may remain invariant,
achieving physical data independence.
• A change to the conceptual definition means that the conceptual/external mapping must
be changed accordingly, so that the external schema may remain invariant, achieving
logical data independence.

16
2 Database Analysis
This unit it concerned with the process of taking a database specification from a customer and
implementing the underlying database structure necessary to support that specification.

2.1 Introduction
Data analysis is concerned with the NATURE and USE of data. It involves the identification of
the data elements which are needed to support the data processing system of the organization, the
placing of these elements into logical groups and the definition of the relationships between the
resulting groups.
Other approaches, e.g. D.F.Ds and Flowcharts, have been concerned with the flow of data-
dataflow methodologies. Data analysis is one of several data structure based methodologies.

Systems analysts often, in practice, go directly from fact finding to implementation dependent
data analysis. Their assumptions about the usage of properties of and relationships between data
elements are embodied directly in record and file designs and computer procedure specifications.
The introduction of Database Management Systems (DBMS) has encouraged a higher level of
analysis, where the data elements are defined by a logical model or `schema' (conceptual
schema). When discussing the schema in the context of a DBMS, the effects of alternative
designs on the efficiency or ease of implementation is considered, i.e. the analysis is still
somewhat implementation dependent.

It is fair to ask why data analysis should be done if it is possible, in practice to go straight to a
computerised system design. Data analysis is time consuming; it throws up a lot of questions.
Implementation may be slowed down while the answers are sought.

From another viewpoint, data analysis provides useful insights for general design principals
which will benefit the trainee analyst even if he finally settles for a `quick and dirty' solution.

The development of techniques of data analysis have helped to understand the structure and
meaning of data in organisations. Data analysis techniques can be used as the first step of
extrapolating the complexities of the real world into a model that can be held on a computer and
be accessed by many users. The data can be gathered by conventional methods such as
interviewing people in the organisation and studying documents. The facts can be represented as
objects of interest. There are a number of documentation tools available for data analysis, such as
entity-relationship diagrams. These are useful aids to communication, help to ensure that the
work is carried out in a thorough manner, and ease the mapping processes that follow data
analysis. Some of the documents can be used as source documents for the data dictionary.

2.2 Database Analysis Life Cycle


When a database designer is approaching the problem of constructing a database system, the
logical steps followed is that of the database analysis life cycle:

• Database study - here the designer creates a written specification in words for the
database system to be built. This involves:

17
o Analyzing the company situation - is it an expanding company, dynamic in its
requirements, mature in nature, solid background in employee training for new
internal products, etc.
o Define problems and constraints - what is the situation currently? How does the
company deal with the task which the new database is to perform? Any issues
around the current method? What are the limits of the new system?
o Define objectives - what is the new database system going to have to do, and in
what way must it be done. What information does the company want to store
specifically, and what does it want to calculate. How will the data evolve?
o Define scope and boundaries - what is stored on this new database system, and
what it stored elsewhere. Will it interface to another database?

Figure 2.1: Database Analysis Life Cycle

• Database Design - conceptual, logical, and physical design steps in taking specifications
to physical implementable designs.
• Implementation and loading - it is quite possible that the database is to run on a
machine which as yet does not have a database management system running on it at the
moment. If this is the case one must be installed on that machine. Once a DBMS has been
installed, the database itself must be created within the DBMS. Finally, not all databases
start completely empty, and thus must be loaded with the initial data set (such as the
current inventory, current staff names, current customer details, etc).
• Testing and evaluation - the database, once implemented, must be tested against the
specification supplied by the client. It is also useful to test the database with the client
using mock data, as clients do not always have a full understanding of what they thing

18
they have specified and how it differs from what they have actually asked for! In
addition, this step in the life cycle offers the chance to the designer to fine-tune the
system for best performance. Finally, it is a good idea to evaluate the database in-situ,
along with any linked applications.
• Operation - this step is where the system is actually in real usage by the company.
• Maintenance and evolution - designers rarely get everything perfect first time, and it
may be the case that the company requests changes to fix problems with the system or to
recommend enhancements or new requirements.
o Commonly development takes place without change to the database structure. In
elderly systems the DB structure becomes fossilised.

2.3 Data Models


Data Model can be defined as set of constructs for defining and manipulating a database. This
includes the following:

• Entity-Relationship – entity, attribute, aggregate


• Semantic – class (abstract, concrete, aggregate), subclass, property
• Functional – entity, functional relationship
• Logic (deductive) – fact, rule
• Object – class, inheritance, attribute, object, method
• Relational –relation, tuple, attribute
Schema can be defined as a particular description using a particular data model.

2.4 Entity Relationship Model

Entity Relationship (ER) modelling

• is a design tool
• is a graphical representation of the database system
• provides a high-level conceptual data model
• supports the user's perception of the data
• is DBMS and hardware independent
• had many variants
• is composed of entities, attributes, and relationships

2.4.1 Conceptual Data Models

• The ER Model is used at this stage.


• Conceptual schema is a concise description of the data requirements of the users
• Includes detailed description of the entity types, relationships and constraints.
• High-level data model
• DBMS-independent data models.
• Goal is to develop a conceptual schema that represents a portion of the real world.
• More general data model than DBMS-specific data models.

19
• Defers DBMS implementation decisions.

2.4.2 The ER Model Basics


• Entities represent real-world "things".
• Relationships describe how entities are related and interact. Commonly used and
supported by many CASE tools.
• Easily mapped to commonly-used implementation data models (especially relational).
• Presents data definition only, no manipulation.
• Any high-level design or schema is subjective.

2.4.3 Entities
• An entity is any object in the system that we want to model and store information about.
Entities represent real-world objects (person, employee, student) and concepts
(department, company, course).
• Individual objects are called entities.
• Groups of the same type of objects are called entity types or entity sets
• Entities are represented by rectangles (either with round or square corners)
• There are two types of entities; weak and strong entity types.

An entity is described by a set of attributes and values (e.g., person.name = "Joe")

2.4.4 Attribute
• All the data relating to an entity is held in its attributes.
• An attribute is a property of an entity.
• Each attribute can have any value from its domain.
• Each entity within an entity type:
o May have any number of attributes.
o Can have different attribute values than that in any other entity.
o Have the same number of attributes.
• Attributes can be :
o simple or composite
o single-valued or multi-valued
• Attributes can be shown on ER models
• They appear inside ovals and are attached to their entity.
• Note that entity types can have a large number of attributes... If all are shown then the
diagrams would be confusing. Only show an attribute if it adds information to the ER
diagram, or clarifies a point.

More Complex Attributes


A number of atomic attributes can be thought of as a single, composite attribute (e.g., address ->
street, city, state, zip).
Single vs. multivalued, e.g., phoneNumbers - composite multi-valued are probably "weak"
entities (e.g., education -> school, degree).

20
Stored vs. derived e.g age derived from birthdates , Bank balance derived from deposits and
withdrawals
2.4.5 Keys
• A key is a data item that allows us to uniquely identify individual occurrences or an entity
type.
• A candidate key is an attribute or set of attributes that uniquely identifies individual
occurrences or an entity type.
• An entity type may have one or more possible candidate keys, the one which is selected
is known as the primary key.
• A composite key is a candidate key that consists of two or more attributes
• The name of each primary key attribute is underlined.

• Each Entity has a Unique Key


• "Super key" is any set of attributes that uniquely identify an entity type.
• Maximal super key is typically composed of all attributes.
• Minimal super key is a super key with only essential attributes.
• Candidate keys are the set of all minimal super keys.
• Primary key is a minimal super key chosen to identify the entity.

• "Weak" entities Must be related to at least one strong entity ("identifying relationship”,
parent/child existence)
• Unique key is combination of entity key and "identifying relationship”
• Nulls: Not applicable, unknown value but known to exist

2.4.6 Relationships
• A relationship type is a meaningful association between entity types
• A relationship is an association of entities where the association includes one entity from
each participating entity type.
• Relationship types are represented on the ER diagram by a series of lines.
• In the original Chen notation, the relationship is placed inside a diamond, e.g. managers
manage employees:

• For this module, we will use an alternative notation, where the relationship is a label on
the line. The meaning is identical

• Relationships define associations between entities (e.g., Student Enrolled_In Course).


• An association between the instances of two or more entity types.

21
2.4.7 Degree of a Relationship
• The number of participating entities in a relationship is known as the degree of the
relationship.
• If there are two entity types involved it is a binary relationship type e.g. student
enrolled_in course , instructor teaches course ,manager manages employee, et.c

• If there are three entity types involved it is a ternary relationship type e.g student
enrolled_in course offered_by school , instructor teaches course at school .Another
example of ternary relationship is given in the diagram below

Note: Don't create unnecessary ternary relationships that can be represented as binary.

• It is possible to have a n-ary relationship (e.g. quaternary or unary).


• Unary relationships are also known as a recursive relationship ie. an entity related to
itself.

• It is a relationship where the same entity participates more than once in different roles.
• In the example above we are saying that employees are managed by employees.
• If we wanted more information about who manages whom, we could introduce a second
entity type called manager.

Questions
• What are the potential problems with recursive relationships?
• Should relationship attributes be a separate entity?
• Which entity ultimately stores that attribute?

• It is also possible to have entities associated through two or more distinct relationships.
This is an example of multiple relationships.

22
• In the representation we use it is not possible to have attributes as part of a relationship.
To support this other entity types need to be developed.

• Relationships can be constrained.

2.4.8 Replacing ternary relationships


When ternary relationships occurs in an ER model they should always be removed before
finishing the model. Sometimes the relationships can be replaced by a series of binary
relationships that link pairs of the original ternary relationship.

• This can result in the loss of some information - It is no longer clear which sales assistant
sold a customer a particular product.
• Try replacing the ternary relationship with an entity type and a set of binary relationships.

Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as
a noun.

• The relationship sells can become the entity type sale.

• So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product. This process also works for higher order relationships.

2.5 Cardinality
• Relationships are rarely one-to-one
• For example, a manager usually manages more than one employee
• This is described by the cardinality of the relationship, for which there are four possible
categories.
o One to one (1:1) relationship
o One to many (1:m) relationship

23
o Many to one (m:1) relationship
o Many to many (m:n) relationship
• On an ER diagram, if the end of a relationship is straight, it represents 1, while a "crow's
foot" end represents many.
• A one to one relationship - a man can only marry one woman, and a woman can only
marry one man, so it is a one to one (1:1) relationship

• A one to may relationship - one manager manages many employees, but each employee
only has one manager, so it is a one to many (1:n) relationship

• A many to one relationship - many students study one course. They do not study more
than one course, so it is a many to one (m:1) relationship

• A many to many relationship - One lecturer teaches many students and a student is taught
by many lecturers, so it is a many to many (m:n) relationship

2.6 Optionality
A relationship can be optional or mandatory.

• If the relationship is mandatory an entity at one end of the relationship must be related to
an entity at the other end.
• The optionality can be different at each end of the relationship. For example, a student
must be on a course. This is mandatory. To the relationship `student studies course' is
mandatory.
• But a course can exist before any students have enrolled. Thus the relationship `course
is_studied_by student' is optional.
• To show optionality, put a circle or `0' at the `optional end' of the relationship.
• As the optional relationship is `course is_studied_by student', and the optional part of this
is the student, then the `O' goes at the student end of the relationship connection.

24
• It is important to know the optionality because you must ensure that whenever you create
a new entity it has the required mandatory links

2.7 Participation
Partial or total ("existence dependency").
student, faculty, course, department - what are the relationships?

Relationships and Weak Entities


• "Weak" entities must participate in a total relationship to an "owner"
• Weak entities can't stand on their own - they have no super key.
• A weak entity always has at least one total participation relationship with a strong entity
(the "owner"). E.g. student(name, id) hasTaken transcript(dept, courseNumber, year,
grade)
• However, total participation does not imply a weak entity.
• faculty(name, id) teaches course(dept, courseNumber, courseName)
• Primary key for weak entity is some or all attributes of the weak entity + the primary key
from "owner" entity.

2.8 Entity Sets


Sometimes it is useful to try out various examples of entities from an ER model. One reason for
this is to confirm the correct cardinality and optionality of a relationship. We use an `entity set
diagram' to show entity examples graphically. Consider the example of `course is_studied_by
student'.

2.8.1 Confirming Correctness

25
• Use the diagram to show all possible relationship scenarios.
• Go back to the requirements specification and check to see if they are allowed.
• If not, then put a cross through the forbidden relationships
• This allows you to show the cardinality and optionality of the relationship

2.8.2 Deriving the relationship parameters


To check we have the correct parameters (sometimes also known as the degree) of a relationship,
ask two questions:

• One course is studied by how many students? Answer = `zero or more'.

o This gives us the degree at the `student' end.


o The answer `zero or more' needs to be split into two parts.
o The `more' part means that the cardinality is `many'.
o The `zero' part means that the relationship is `optional'.
o If the answer was `one or more', then the relationship would be `mandatory'.

• One student studies how many courses? Answer = `One'

o This gives us the degree at the `course' end of the relationship.


o The answer `one' means that the cardinality of this relationship is 1, and is
`mandatory'
o If the answer had been `zero or one', then the cardinality of the relationship would
have been 1, and be `optional'.

2.9 Redundant relationships

Some ER diagrams end up with a relationship loop.

• check to see if it is possible to break the loop without losing info


• Given three entities A, B, C, where there are relations A-B, B-C, and C-A, check if it is
possible to navigate between A and C via B. If it is possible, then A-C was a redundant
relationship.
• Always check carefully for ways to simplify your ER diagram. It makes it easier to read
the remaining information.

Redundant relationships example

• Consider entities `customer' (customer details), `address' (the address of a customer) and
`distance' (distance from the company to the customer address).

26
2.10 Splitting n:m Relationships

A many to many relationship in an ER model is not necessarily incorrect. They can be replaced
using an intermediate entity. This should only be done where:

• the m:n relationship hides an entity


• the resulting ER diagram is easier to understand.

Splitting n:m Relationships - Example

Consider the case of a car hire company. Customers hire cars, one customer hires many card and
a car is hired by many customers.

The many to many relationship can be broken down to reveal a `hire' entity, which contains an
attribute `date of hire'.

2.11 Constructing an ER model

Before beginning to draw the ER model, read the requirements specification carefully. Document
any assumptions you need to make.

• Identify entities - list all potential entity types. These are the object of interest in the
system. It is better to put too many entities in at this stage and them discard them later if
necessary.

• Remove duplicate entities - Ensure that they really separate entity types or just two names
for the same thing.

o Also do not include the system as an entity type


o e.g. if modelling a library, the entity types might be books, borrowers, etc.
o The library is the system, thus should not be an entity type.

• List the attributes of each entity (all properties to describe the entity which are relevant to
the application).

o Ensure that the entity types are really needed.


o are any of them just attributes of another entity type?
o if so keep them as attributes and cross them off the entity list.
o Do not have attributes of one entity as attributes of another entity!

27
• Mark the primary keys.
o Which attributes uniquely identify instances of that entity type?
o This may not be possible for some weak entities.
• Define the relationships
o Examine each entity type to see its relationship to the others.
• Describe the cardinality and optionality of the relationships
o Examine the constraints between participating entities.
• Remove redundant relationships

o Examine the ER model for redundant relationships.

Figure: ER Notation

28
ER modelling is an iterative process, so draw several versions, refining each one until you are
happy with it. Note that there is no one right answer to the problem, but some solutions are better
than others!

Naming Conventions
• Nouns for entity types and attributes.
• Verbs for relationships.
• Links between entities and relationships can be named with the "role" being played by
the entity in that relationship.
o A student entity plays the role of teaching Assistant for course.
o A faculty entity plays the role of advisor for student.
Pathways
• Identify and give names to the "things" being modeled, i.e., identify the entities.
• List the attributes for the entities.
• Based on the semantics, list the candidate keys and select a primary key (except for weak
entities).
• Discover and give active descriptions to relationships between entities.
• Find "identifying relationships" for weak entities by looking for "owner" entity, situations
where one entity type is information about another entity type.
• Finish off some details: look for total participation, cardinality, role names, etc.
2.12 ER Examples

A Country Bus Company owns a number of busses. Each bus is allocated to a particular route,
although some routes may have several busses. Each route passes through a number of towns.
One or more drivers are allocated to each stage of a route, which corresponds to a journey
through some or all of the towns on a route. Some of the towns have a garage where busses are
kept and each of the busses are identified by the registration number and can carry different
numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each
route is identified by a route number and information is available on the average number of
passengers carried per day for each route. Drivers have an employee number, name, address, and
sometimes a telephone number.

Entities

• Bus - Company owns busses and will hold information about them.
• Route - Buses travel on routes and will need described.
• Town - Buses pass through towns and need to know about them
• Driver - Company employs drivers, personnel will hold their data.
• Stage - Routes are made up of stages
• Garage - Garage houses buses, and need to know where they are.

Relationships

• A bus is allocated to a route and a route may have several buses.


• Bus-route (m:1) is serviced by

29
• A route comprises of one or more stages.
• route-stage (1:m) comprises
• One or more drivers are allocated to each stage.
• driver-stage (m:1) is allocated
• A stage passes through some or all of the towns on a route.
• stage-town (m:n) passes-through
• A route passes through some or all of the towns
• route-town (m:n) passes-through
• Some of the towns have a garage
• garage-town (1:1) is situated
• A garage keeps buses and each bus has one `home' garage
• garage-bus (m:1) is garaged

Draw E-R Diagram

Figure : Bus Company


Attributes

• Bus (reg-no,make,size,deck,no-pass)
• Route (route-no,avg-pass)
• Driver (emp-no,name,address,tel-no)
• Town (name)
• Stage (stage-no)
• Garage (name,address)

2.13 Problems with ER Models

There are several problems that may arise when designing a conceptual data model. These are
known as connection traps.

There are two main types of connection traps:

1. fan traps

30
2. chasm traps

2.13.1 Fan traps


A fan trap occurs when a model represents a relationship between entity types, but the pathway
between certain entity occurrences is ambiguous. It occurs when 1:m relationships fan out from a
single entity.

A single site contains many departments and employs many staff. However, which staff work in
a particular department?

The fan trap is resolved by restructuring the original ER model to represent the correct
association.

2.14 Chasm traps

A chasm trap occurs when a model suggests the existence of a relationship between entity types,
but the pathway does not exist between certain entity occurrences.

It occurs where there is a relationship with partial participation, which forms part of the pathway
between entities that are related.

• A single branch is allocated many staff who oversee the management of properties for
rent. Not all staff oversee property and not all property is managed by a member of staff.
• What properties are available at a branch?
• The partial participation of Staff and Property in the oversees relation means that some
properties cannot be associated with a branch office through a member of staff.
• We need to add the missing relationship which is called `has' between the Branch and the
Property entities.
• You need to therefore be careful when you remove relationships which you consider to
be redundant.

31
2.15 Enhanced ER Models (EER)

The basic concepts of ER modelling is not powerful enough for some complex applications... We
require some additional semantic modelling concepts:

• Specialisation
• Generalisation
• Categorisation
• Aggregation

First we need some new entity constructs.

• Superclass - an entity type that includes distinct subclasses that require to be represented
in a data model.
• Subclass - an entity type that has a distinct role and is also a member of a superclass.

Subclasses need not be mutually exclusive; a member of staff may be a manager and a sales
person.

The purpose of introducing superclasses and subclasses is to avoid describing types of staff with
possibly different attributes within a single entity. This could waste space and you might want to
make some attributes mandatory for some types of staff but other staff would not need these
attributes at all.

2.15.1 Key Constraints

Example: Key constraints on manages


The restriction that each department has at most one manager is an example of a key constraint.
• Each department entity appears in at most one manages relationship in any allowable
instance of manages.
• Restriction is indicated on the ER diagram using an arrow.

32
2.15.2 Participation Constraints

PC specifies whether the existence of an entity depends on its being related to another entity via
relationship type.
E.g. A requirement that each department should have a manager.
The participation of the entity set Departments in the relationship set Manages is said to be total,
meaning that every entity in the “total set” of employee entity must be related to a department
entity via works for.
A participation that is not total is said to be partial, meaning that some parts of employees entity
is related to department entity via manages.
2.15.3 Weak Entities

These are entities that do not have key attributes of their own.:

In contrast regular (strong) entities have key attributes.


Entities belonging to a weak entity type are identified by being related to specific entities from
another entity type in combination with some of their attribute values.
This other entity type is called the identifying or owner entity type.
The relationship type that relates weak entity type to its owner is called identifying relationship.
A weak entity type always has a total participation constraint.

Aggregation
One limitation with ER model is that it cannot express relationships among relationships.
Aggregation is an abstraction through which relationships are treated as higher level entities
It allows us to indicate that a relationship set ( identified through dashed box) participates in
another relationship set.
We use aggregation when we want to express relationship among relationships.

Aggregation

Class Hierarchies
Sometimes it is natural to classify the entities in an entity set into subclasses.
E.g. Consider Hourly_Emps and Contract_Emps entity sets.

33
We say that the attributes for the entity set employees is inherited by the entity of Hourly_Emps
and that Hourly_Emps ISA Employees

2.15.4 Aggregation
•One limitation with ER model is that it cannot express relationships among relationships.
•Aggregation is an abstraction through which relationships are treated as higher level entities
–It allows us to indicate that a relationship set (identified through dashed box) participates in
another relationship set.

We use aggregation when we want to express relationship among relationships

2.15.5 Class Hierarchies


•Sometimes it is natural to classify the entities in an entity set into subclasses.
–E.g. Consider Hourly_Emps and Contract_Emps entity sets.
•We say that the attributes for the entity set employees is inherited by the entity of Hourly_Emps
and that Hourly_Emps ISA Employees
•A class hierarchy can be viewed in one of the two ways:
•Employees is specialized into subclasses.
•Specialization is the process of identifying subsets of an entity set (the superclass )that shere
some distinguishing characteristics.
•Typically the superclass is defined fist followed by the sub classes.
•Hourly_Emps and Contract _Emps are generalized by employees.
•We can specify two constraints wrt. ISA hierarchies:
–Overlap constraints: Determine whether two subclasses are allowed to contain the same entity.

34
–Covering constraints determine whether the entities in the subclasses collectively include all
entities in the superclass.

2.15.6 Specialisation

This is the process of maximising the differences between members of an entity by identifying
their distinguishing characteristics.

• Staff(staff_no,name,address,dob)
• Manager(bonus)
• Secretary(wp_skills)
• Sales_personnel(sales_area, car_allowance)

• Here we have shown that the manages relationship is only applicable to the Manager
subclass, whereas the works_for relationship is applicable to all staff.
• It is possible to have subclasses of subclasses.

2.15.7 Generalisation

Generalisation is the process of minimising the differences between entities by identifying


common features. This is the identification of a generalised superclass from the original
subclasses. This is the process of identifying the common attributes and relationships.

For instance, taking:

car(regno,colour,make,model,numSeats)
motorbike(regno,colour,make,model,hasWindshield)

And forming:

vehicle(regno,colour,make,model,numSeats,hasWindshielf)

In this case vehicle has numSeats which would be NULL if the vehicle was a motorbike, and has
hasWindshield which would be NULL if it was a car.

35
3 Structured Query Language
In the other chapters of this course consideration is given to producing a good design for a
database structure or schema. In this chapter the focus is on applying this schema to a database
management system, and then using that DBMS to allow storage and retrieval of data.
To communicate with the database system itself we need a language. We use the SQL for
illustration; however, other DBMSs (Oracle, Access) can also be used. SQL is an international
standard language for manipulating relational databases. It is based on an IBM product. SQL is
short for Structured Query Language.
SQL can create schemas, delete them, and change them. It can also put data into schemas and
remove data. It is a data handling language, but it is not a programming language.

SQL is a DSL (Data Sub Language), which is really a combination of two languages. These are
the Data Definition Language (DDL) and the Data Manipulation Language (DML). Schema
changes are part of the DDL, while data changes are part of the DML. We will consider both
parts of the DSL in this discussion of SQL.

3.1 Database Models


Data Model can be defined as a set of constructs for defining and manipulating a database.
A data model comprises
• a data structure
• a set of integrity constraints
• operations associated with the data structure

Examples of data models include:

• hierarchic
• network
• relational

Models other than the relational database module used to be quite popular. Each model type is
appropriate to particular types of problem. The Relational model type is the most popular in use
today, and the other types are not discussed further.

3.2 Relational Databases


The relational data model is the primary data model for commercial data processing

The relational data model comprises:

• relational data structure


• relational integrity constraints
• relational algebra or equivalent (SQL)

36
SQL is an ISO language based on relational algebra. Relational algebra is a mathematical
formulation
3.3 Relational Data Structure
A relational data structure is a collection of tables or relations.
• Each relation is a table with rows and columns, thus, a relation is a collection of rows or
tuples .
• This tabular representation is simple and permits the use of queries to the data
• A tuple is a collection of columns or attributes
• A domain is a pool of values from which the actual attribute values are taken.

3.4 Domain and Integrity Constraints


• Domain Constraints
o limit the range of domain values of an attribute
o specify uniqueness and `nullness' of an attribute
o specify a default value for an attribute when no value is provided.
• Entity Integrity
o every tuple is uniquely identified by a unique non-null attribute, the primary key.
• Referential Integrity
o rows in different tables are correctly related by valid key values (`foreign' keys
refer to primary keys).

3.5 Structure of a Table


In the design process tables are defined, and the relationships between tables identified.
Remember a relationship is just a link between two concepts.

• A table consists of a collection of tables, each of which is assigned a unique name.


• Example Consider the Accounts Table below

Accounts Relations:
Account number Branch name Balance
A-101 Downtown 500
A-102 Westlands 400
A-201 Parklands 900
A-215 Meru 750
A-217 Kisii 700
A-222 Embu 350
• A row in a table represents a relationship among a set of values.
• Table is a collection of such relationships.
• We follow the terminology of relational models.
• Table headers are referred to as attributes.
• For each attribute there is a set of permitted values called the domain of that attribute.

37
• Because tables are relations, we use mathematical terms relation and tuple in place if
table and row.
• For all relations, the domains of all attributes should be atomic, i.e. elements of the
domain are considered indivisible.

Consider another example of a table holding "drivers" and a table holding "car" information...
Each car is owned by a driver, and therefore there is a link between "car" and "driver" to indicate
which driver owns which car.

In the subsequent pages we will refer back to this driver and car arrangement. To make the
examples easier, lets create some example data.

Car
The CAR table has the following structure:

• REGNO : The registration number of the car


• MAKE : The manufacturer of the car
• COLOUR: The colour of the car
• PRICE : The price of the car when it was bought new

DRIVER
The DRIVER table has the following structure:

• NAME : The full name of the driver


• DOB : The data of birth of the driver

Relationship between CAR and DRIVER

The DRIVER and the CAR has a relationship between them of N:1. This indicates that a CAR
can have only 1 DRIVER, but that a DRIVER can own more than 1 CAR simultaneously.

In the design section we can see that this requires a FOREIGN KEY in the CAR end of the
relationship. This foreign key allows us to implement the relationship in the database. We will
call this field OWNER.

Example Data: DRIVER

NAME DOB
Jim Otti 11 Jan 1980
Bob Kinuthia 23 Mar 1981
Bob Fondo 3 Dec 1986

38
CAR
REGNO MAKE COLOUR PRICE OWNER
F611 AAA FORD RED 12000 Jim Smith
J111 BBB SKODA BLUE 11000 Jim Smith
A155 BDE MERCEDES BLUE 22000 Bob Smith
K555 GHT FIAT GREEN 6000 Bob Jones
SC04 BFE SMART BLUE 13000

3.5.1 Columns or Attributes

Each column is given a name which is unique within a table

Each column holds data of one specified type. E.g.

integer decimal
character text data
-- the range of values can be further constrained

If a column of a row contains no data, we say it is NULL. For example, a car just off the
production line might not have an owner in the database until someone buys the car. A NULL
value may also indicate that the value is unavailable or inappropriate . This might be the case for
a car which is being destroyed or a car where two people are arguing in court that they are both
the owner.

3.5.2 Basic Structure:


• A relation consists of a relation schema and a relation instance
• The relation instance is a table and the relation schema describes the column heads for
the table.
• Schema describes the relation’s name, the name of each field (or column or attribute),
and the domain of each field.
• An instance of a relation is a set of tuples (also called records) in which each tuple has the
same number of fields as the schema.

Some important rules:

• All rows of a table must be different in some way from all other rows.
• Sometimes a row is referred to as a Tuple.
• Cardinality is the number of ROWS in a table.
• Arity is the number of COLUMNS in a table.

39
Notations:

• A relation consists of a relation schema and a relation instance


• The relation instance is a table and the relation schema describes the column heads for the
table.

40
• Schema describes the relation’s name, the name of each field( or column or attribute), and the
domain of each field.
• An instance of a relation is a set of tuples (also called records) in which each tuple has the
same number of fields as the schema.

Relation Schema: Notations


• Relational schema: R(A1,A2,….An)
• R is the relation name
• A1,A2,….An are a list of attributes
• Convention: Lowercase names for relations and names beginning with upper case for relation
schemas.
• E.g. Account-schema=(account-number, branch-name, balance)
• Each attribute is the name of the role played by some domain D in the relation schema R.
• An n-tuple in a relation r(R) is denoted by t = <v1,v2,…,vn)
o Vi is the value corresponding to attribute Ai
• Notations for component values of tuples
o t[Ai] or t.Ai –value Vi in t for attribute Ai
• The letters Q,R,S denote the relation name
• The letters q, r, s denote the relation states
• The letters t, u, v denote tuples
3.5.3 Characteristics of Relations
There are several characteristics that make relations different from a table:
•Ordering of tuples in a relation
•Relation defined as a set of tuples
•Mathematically elements of a set have no order
•Tuple ordering is not a part of a relation definition
•In files, records are ordered.
•Ordering of values within a tuple
•Initial definition of relation: an n-tuple is an ordered list of n values.
•Alternative definition: relation schema is a set of attributes
•Values in the tuples: Each value is atomic hence composite and multivalued attributes are not
allowed,
•Interpretation of a relation: The relational schema can be interpreted as a declaration or a type
of assertion
•The schema of the Student relation asserts that in general, a student entity has a name, sid,age,
address…

3.6 Primary Keys

A table requires a key which uniquely identifies each row in the table. This is entity integrity.

41
The key could have one column, or it could use all the columns. It should not use more columns
than necessary. A key with more than one column is called a composite key.

A table may have several possible keys, the candidate keys, from which one is chosen as the
primary key.

No part of a primary key may be NULL.

If the rows of the data are not unique, it is necessary to generate an artificial primary key.

In our example, DRIVER has a primary key of NAME, and CAR has a primary key of REGNO.
This database will break if there are two drivers with the same name, but it gives you an idea
what the primary key means...

3.7 Integrity Constraints over Relations

•An integrity constraint (IC) is a condition that is specified on a database schema and restricts
the data that can be stored in an instance of a database.
•Legal instance: A databases instance that satisfies all the ICs.
•A DBMS enforces ICs in that it permits only legal instances to be stored in a database
When to specify ICs

•When DBA or end user defines a database schema.


•When a database application is run, the DBMS checks for violations and disallows changes to
the data that violate the ICs.

3.7.1 Kinds of Constraints

•Domain constraints.
•Key constraints.
•Foreign Key constraint
•General Constraints
Domain constraints: Specifies an important condition that we want each instance of a relation to
satisfy: Each attribute A must be atomic. Values that appear in a column must be drawn from the
domain associated with that column.

•Primary Key constraints: A statement that a certain minimal subset of the field of a relation is
a unique identifier for a tuple.

E.g. For a student relation there should be a constraint that no two students should have the same
student ID.

42
•Candidate key (key): Set of fields that uniquely identifies a tuple according to a key constraints
•By convention, the attribute that form the primary key of a relation schema are underlined.
•A relation may have several candidate keys
•E.g. {sid, name}, and {login, sid} in the STUDENT relation are both candidate keys.
•The keys must identify the tuples uniquely in all the instances of a relation.
Foreign Key constraints
•Sometimes, information stored in a relation is linked to information stored in another relation
•If one relation is modified the other one should be checked for consistency.
General constraints

•More general constraints needed to prevent data entry errors


•E.g. We may require age to be within some range
•May be thought of as extended domain constraint
•In general constraints that go beyond domain, key and foreign key constraints can be specified.
•Supported by current relational database via table constraints and assertions.
•Table constraints : Associated with a single table and are checked whenever that table is
modified
•Assertions: Involves several table and are checked whenever any of these tables table is
modified

3.8 SQL Basics

Basic SQL Statements include:

• CREATE - a data structure


• SELECT - read one or more rows from a table
• INSERT - one or more rows into a table
• DELETE - one or more rows from a table
• UPDATE - change the column values in a row
• DROP - a data structure

In the remainder of this section only simple SELECT statements are considered.

Simple SELECT

The syntax of a SELECT statement is :

SELECT column FROM tablename

43
This would produce all the rows from the specified table, but only for the particular column
mentioned. If you want more than one column shown, you can put in multiple columns
separating them with commas, like:

SELECT column1,column2,column3 FROM tablename

If you want to see all the columns of a particular table, you can type:

SELECT * FROM tablename

Lets see it in action on CAR...

SELECT * FROM car;


REGNO MAKE COLOUR PRICE OWNER
F611 AAA FORD RED 12000 Jim Smith
J111 BBB SKODA BLUE 11000 Jim Smith
A155 BDE MERCEDES BLUE 22000 Bob Smith
K555 GHT FIAT GREEN 6000 Bob Jones
SC04 BFE SMART BLUE 13000

SELECT regno FROM car;

REGNO
F611 AAA
J111 BBB
A155 BDE
K555 GHT
SC04 BFE

SELECT colour,owner FROM car;

COLOUR OWNER
RED Jim Smith
BLUE Jim Smith
BLUE Bob Smith
GREEN Bob Jones
BLUE

In SQL, you can put extra space characters and return characters just about anywhere without
changing the meaning of the SQL. SQL is also case-insensitive (except for things in quotes). In
addition, SQL in theory should always end with a ';' character. You need to include the ';' if you
have two different SQL queries so that the system can tell when one SQL statement stops and
another one starts. If you forget the ';' the online interface will put one in for you. For these
reasons all of the following statements are identical and valid.

44
SELECT REGNO FROM CAR;

SELECT REGNO FROM CAR

Select REGNO from CAR

select regno FROM car

SELECT
regno
FROM car;

Comments
Sometimes you might want to write a comment in somewhere as part of an SQL statement. A
comment in this case is a simple piece of text which is meaningful to yourself, but should be
ignored by the database. The characters '--', when they appear in a query, indicate the start of a
comment. Everything after that point is ignored until the end of that line. The following queries
are all equivalent.
SELECT regno
FROM car;

SELECT regno -- The registration number


FROM car -- The car storage table
;

Warning: You cannot put a comment immediately after a ';'. Comments are only supported
within the text of an SQL statement. The following will cause SQL errors:

SELECT regno
FROM car; -- Error here as comment is after the query

-- Error here as comment is before the start of the query


SELECT regno
FROM car;

SELECT filters
Displaying all the rows of a table can be handy, but if we have tables with millions of rows then
this type of query could take hours. Instead, we can add "filters" onto a SELECT statement to
only show specific rows of a table. These filters are written into an optional part of the SELECT
statement, known as a WHERE clause.
SELECT columns
FROM table
WHERE rule

45
The "rule" section of the WHERE clause is checked for every row that a select statement would
normally show. If the whole rule is TRUE, then that row is shown, whereas if the rule is FALSE,
then that row is not shown.

The rule itself can be quite complex. The simplest rule is a single equality test, such as
"COLOUR = 'RED'".

Without the WHERE rule would show:

SELECT regno from CAR;


REGNO
F611 AAA
J111 BBB
A155 BDE
K555 GHT
SC04 BFE

From the database we know that only F611 AAA is RED, and the rest of the cars are either
BLUE or GREEN. Thus a rule COLOUR = 'RED' is only true on the row with F611 AAA, and
false elsewhere. With everything in a query:

SELECT regno from CAR


WHERE colour = 'RED';
REGNO
F611 AAA

An important point to note is that queries are case sensitive between the quotes. Thus 'RED' will
work, but 'red' will produce nothing. The case used in the quotes must match perfectly the case
stored in the table. SQL is not forgiving and if you forget you can be scratching you head for
hours trying to fix it.

Note also that "colour" does not have to appear on the SELECT line as a column name. It can if
you want to see the colour, but there is no requirement for it to be there. Therefore this will work
too:

SELECT regno,colour from CAR


WHERE colour = 'RED';

REGNO COLOUR
F611 AAA RED

Comparisons

46
SQL supports a variety of comparison rules for use in a WHERE clause. These include =,!=,<>,
<, <=, >, and >=.

Examples of a single rule using these comparisons are:

WHERE colour = 'RED' The colour attribute must be RED


WHERE colour != 'RED' The colour must be a colour OTHER THAN RED
WHERE colour <> 'RED' The same as !=
WHERE PRICE > 10000 The price of the car is MORE THAN 10000
WHERE PRICE >= 10000 The price of the car is EQUAL TO OR MORE THAN 10000
WHERE PRICE < 10000 The price of the car is LESS THAN 10000
WHERE PRICE <= 10000 The price of the car is EQUAL TO OR LESS THAN 10000

Note that when dealing with strings, like RED, you must say 'RED'. When dealing with numbers,
like 10000, you can say '10000' or 10000. The choice is yours.

Dates

Date rules are some of the hardest rules to get right in writing SQL, yet there is nothing
particularly complex about them. The hard part is working out what it means to be GREATER
THAN a particular date.

In date calculations, you can use all the normal comparators.

SELECT name,dob from driver

NAME DOB
Jim Smith 11 Jan 1980
Bob Smith 23 Mar 1981
Bob Jones 3 Dec 1986

SELECT name,dob from driver


WHERE DOB = '3 Dec 1986'

NAME DOB
Bob Jones 3 Dec 1986

In other comparators, it is important to realise that a date gets bigger as you move into the future,
and smaller as you move into the past. Thus to say 'DATE1 < DATE2' you are stating that
DATE1 occurs before DATE2 on a calender. For example, to find all drivers who were born on
or after the 1st Jan 1981 you would do:

SELECT name,dob from driver


WHERE DOB >= '1 Jan 1981'

NAME DOB

47
Bob Smith 23 Mar 1981
Bob Jones 3 Dec 1986

The syntax for dates does change slightly on difference database systems, but the syntax '1 Jan
2000' works in general on all systems. Oracle also allows dates like '1-Jan-2000' and '1-Jan-00'.
If you specify a year using only the last two digits, Oracle uses the current date to compute the
missing parts of the year, converting '00' to '2000'. Do not get confused by saying '87' for '1987'
and ending up with '2087'!

BETWEEN

Sometimes when you are dealing with dates you want to specify a range of dates to check. The
best way of doing this is using BETWEEN. For instance, to find all the drivers born between
1995 and 1999 you could use:

SELECT name,dob from driver


WHERE DOB between '1 Jan 1985' and '31 Dec 1999'

NAME DOB
Bob Jones 3 Dec 1986

Note that the dates have day of the month and month in them, and not just the year. In SQL, all
dates must have a month and a year. If you try to use just a year the query will fail.

BETWEEN works for other things, not just dates. For instance, to find cars worth between 5000
and 10000, you could execute:

SELECT regno
FROM car
where price between 5000 and 10000;

REGNO PRICE
K555 GHT

NULL

The NULL value indicates that something has no real value. For this reason the normal value
comparisons will always fail if you are dealing with a NULL. If you are looking for NULL, for
instance looking for cars without owners using OWNER of CAR, all of the following are wrong!

SELECT regno from CAR WHERE OWNER = NULL WRONG!


SELECT regno from CAR WHERE OWNER = 'NULL' WRONG!

Instead SQL has a special comparison operator called IS which allows us to find NULL values.
There is also an opposite to IS, called IS NOT, which finds all the values which are not NULL.

48
So finding all the regnos of cars with current owners would be (note that if they have an owner,
then the owner has a value and thus is NOT NULL):

SELECT REGNO from CAR


WHERE OWNER is not NULL

REGNO
F611 AAA
J111 BBB
A155 BDE
K555 GHT

And finding cars without owners would be:

SELECT REGNO from CAR


WHERE OWNER is NULL
REGNO
SC04 BFE

LIKE

When dealing with strings, sometimes you do not want to match on exact strings like ='RED',
but instead on partial strings, substrings, or particular patterns. This could allow you, for
instance, to find all cars with a colour starting with 'B'. The LIKE operator provides this
functionality.

The LIKE operator is used in place of an '=' sign. In its basic form it is identical to '='. For
instance, both of the following statements are identical:

SELECT regno FROM car WHERE colour = 'RED';


SELECT regno FROM car WHERE colour LIKE 'RED';

The power of LIKE is that it supports two special characters, '%' and '-'. These are equivalent to
the '*' and '?' wildcard characters of DOS. Whenever there is an '-' character in the string, any
character will match. Whenever there is an '%' character in the string, 0 or more characters will
match. Consider these rules:

name LIKE 'Jim Smith' Matches 'Jim Smith'


name LIKE '_im Smith' Matches things like 'Jim Smith' or 'Tim Smith'
name LIKE '___ Smith' Matches 'Jim Smith' and 'Bob Smith'
name LIKE '% Smith' Matches 'Jim Smith' and 'Bob Smith'
name LIKE '% S%' Matches 'Jim Smith' and 'Bob Smith'
name LIKE 'Bob %' Matches 'Bob Jones' and 'Bob Smith'
name LIKE '%' Matches anything not null

49
Note however that LIKE is more powerful than a simple '=' operator, and thus takes longer to
run. If you are not using any wildcard characters in a LIKE operator then you should always
replace LIKE with '='.

50
4 DATABASE DESIGN
In this chapter, we shall consider:

• Schema Refinement
• Functional Dependencies
• Normal Forms
• Decompositions
• Normalization
• Other Types of Functional Dependencies
4.1 Schema Refinement
Conceptual database design gives a set of relational schemas and integrity constraints (ICs).This
initial design must be refined by taking the ICs into account more carefully.
These constraints include:
•Functional Dependencies (Most important)
•Multivalued Dependencies
•Join Dependencies
Problems that schema refinement is intended to address:
Redundancy
•One method of eliminating redundancy is decomposition.
•Decomposition also has some problems.
4.1.1 Problems Caused by Redundancy
• Redundant storage: Some information is stored repeatedly
• Update anomalies:If one copy of such repeated data is updated, an inconsistency is created.
• Insertion anomalies: it may not be possible to store some information unless some other
information is stored as well.
• Deletion anomalies: It may not be possible to delete some information without losing some
information as well.
An Example: Hourly_Emps Relation
•Hourly_Emps(ssn, name, lot, rating, hourly_wages, hours_worked)

•Hourly_Emps(ssn, name, lot, rating, hourly_wages, hours_worked)


Key: ssn
Suppose that hourly_wages attribute is determined by rating, i.e. for a given rating value there is
only one hourly_wage.
This is an example of functional dependency. It leads to possible redundancy in the
Hourly_Emps relation.

Hourly_Emps(ssn, name, lot, rating, hourly_wages, hours_worked)


Redundancy: If the same value apears in the rating column it must appear in the hourly_wage
column.
Consequences:
Some information is stored multiple times- leads to wastage of space and inconsistency.

51
We cannot insert a tuple for an employee unless we know the hourly wage for the employee’s
rating value(an insertion anomaly)
If we delete all tuples with a given rating value, we lose the association between rating and
hourly_rating(a deletion anomaly)

4.1.2 Use of Decomposition


We briefly discuss decomposition in this section but more discussion is found in the next section.
Redundancy can be addressed by replacing a relation with a collection of smaller relation, a
process called decomposition
Hourly_Emps(ssn, name, lot, rating, hourly_wages, hours_worked)
Hourly_Emps2(ssn, name, lot, hours_worked)
Wages(rating, hourly_wages)

Do we need to decompose a relation?

If a schema is in one of the normal forms then we know that no problem can exist.
Considering the normal form can help us decide whether to decompose it or not.

Properties of decomposition
• Lossless join: Enables us to recover any instance of the decomposed relation from the
corresponding instance of the smaller relation.
• Dependency preservation: Enables us to enforce a constraint on the original relation by
simply enforcing it in the smaller relation.

Problems of Decomposition
Queries over the original relation may require us to join the decomposed relations.

4.1.3 Informal Design Guidelines for Relational Schemas


• Semantics of the relation attributes
Design a relation schema so that is is easy to explain its meaning. Do not combine attributes
from multiple entity types into a single relation.
• Redundant information in tuples & update anomalies
Design the base relation schemas so that no insertion so that no insertion, deletion or
modification anomalies are present in the relation.
• Null values in tuples
As far as possible,avoid placing attributes in a base relation whose values may frequently be null.
This is useful since nulls can have interpretaions such as; attribute does not apply to this tuple,
value unknown, or known but absent.
• Generation of spurious tuples
Design relation schemas so that they can be joined with equality conditions on attributes that are
either primary keys or foreign keys in a way that guarantees that no spurious tuples are
generated.
4.2 Functional Dependencies

52
A functional dependency (FD) is a kind of IC btw two sets of attributes that generalizes the
concept of a key.

4.3 Normalization
What is normalisation?
Normalisation is the process of taking data from a problem and reducing it to a set of relations
while ensuring data integrity and eliminating data redundancy

• Data integrity - all of the data in the database are consistent, and satisfy all integrity
constraints.
• Data redundancy – if data in the database can be found in two different locations (direct
redundancy) or if data can be calculated from other data items (indirect redundancy) then
the data is said to contain redundancy.

Data should only be stored once and avoid storing data that can be calculated from other data
already held in the database. During the process of normalisation redundancy must be removed,
but not at the expense of breaking data integrity rules.
If redundancy exists in the database then problems can arise when the database is in normal
operation:

• When data is inserted the data must be duplicated correctly in all places where there is
redundancy. For instance, if two tables exist for in a database, and both tables contain the
employee name, then creating a new employee entry requires that both tables be updated
with the employee name.
• When data is modified in the database, if the data being changed has redundancy, then all
versions of the redundant data must be updated simultaneously. So in the employee
example a change to the employee name must happen in both tables simultaneously.

The removal of redundancy helps to prevent insertion, deletion, and update errors, since the data
is only available in one attribute of one table in the database.
The data in the database can be considered to be in one of a number of `normal forms'. Basically
the normal form of the data indicates how much redundancy is in that data. The normal forms
have a strict ordering:

• 1st Normal Form


• 2nd Normal Form
• 3rd Normal Form
• BCNF

There are other normal forms, such as 4th and 5th normal forms. They are rarely utilised in
system design and are not considered further here.
To be in a particular form requires that the data meets the criteria to also be in all normal forms
before that form. Thus to be in 2nd normal form the data must meet the criteria for both 2nd
normal form and 1st normal form. The higher the form the more redundancy has been eliminated.

53
4.4 Integrity Constraints
An integrity constraint is a rule that restricts the values that may be present in the database. The
relational data model includes constraints that are used to verify the validity of the data as well as
adding meaningful structure to it:

• Entity integrity:

The rows (or tuples) in a relation represent entities, and each one must be uniquely identified.
Hence we have the primary key that must have a unique non-null value for each row.

• Referential integrity:

This constraint involves the foreign keys. Foreign keys tie the relations together, so it is vitally
important that the links are correct. Every foreign key must either be null or its value must be the
actual value of a key in another relation.
4.5 Understanding Data
Sometimes the starting point for understanding data is given in the form of relations and
functional dependencies. This would be the case where the starting point in the process was a
detailed specification of the problem. We already know what relations are. Functional
dependencies are rules stating that given a certain set of attributes (the determinant) determines a
second set of attributes.
The definition of a functional dependency looks like A->B. In this case B is a single attribute but
it can be as many attributes as required (for instance, X->J,K,L,M). In the functional
dependency, the determinant (the left hand side of the -> sign) can determine the set of attributes
on the right hand side of the -> sign. This basically means that A selects a particular value for B,
and that A is unique. In the second example X is unique and selects a particular set of values for
J,K,L, and M. It can also be said that B is functionally dependent on A. In addition, a particular
value of A ALWAYS gives you a particular value for B, but not vice-versa.
Consider this example:
R(matric_no, firstname, surname, tutor_number, tutor_name)
tutor_number -> tutor_name
Here there is a relation R, and a functional dependency that indicates that:

• instances of tutor_number are unique in the data


• from the data, given a tutor_number, it is always possible to work out the tutor_name.
• As an example tutor number 1 may be “Mr Smith”, but tutor number 10 may also be “Mr
Smith”. Given a tutor number of 1, this is ALWAYS “Mr Smith”. However, given the
name “Mr Smith” it is not possible to work out if we are talking about tutor 1 or tutor 10.

There is actually a second functional dependency for this relation, which can be worked out from
the relation itself. As the relation has a primary key, then given this attribute you can determine
all the other attributes in R. This is an implied functional dependency and is not normally listed
in the list of functional dependents.

54
Extracting understanding
It is possible that the relations and the determinants have not yet been defined for a problem, and
therefore must be calculated from examples of the data. Consider the following Student table.

Student - an unnormalised tablewith repeating groups

matric_no Name date_of_birth subject grade


960100 Smith, J 14/11/1977 Databases C
Soft_Dev A
ISDE D
960105 White, A 10/05/1975 Soft_Dev B
ISDE B
960120 Moore, T 11/03/1970 Databases A
Soft_Dev B
Workshop C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973 Databases B
Soft_Dev D
ISDE C
Workshop D
The subject/grade pair is repeated for each student. 960145 has 1 pair while 960150 has four.
Repeating groups are placed inside another set of parentheses. From the table the following
relation is generated:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
The repeating group needs a key in order that the relation can be correctly defined. Looking at
the data one can see that grade repeats within matric_no (for instance, for 960150, the student
has 2 D grades). However, subject never seems to repeat for a single matric_no, and therefore is
a candidate key in the repeating group.
Whenever keys or dependencies are extracted from example data, the information extracted is
only as good as the data sample examined. It could be that another data sample disproves some
of the key selections made or dependencies extracted. What is important however is that the
information extracted during these exercises is correct for the data being examined.
Looking at the data itself, we can see that the same name appears more than once in the name
column. The name in conjunction with the date_of_birth seems to be unique, suggesting a
functional dependency of:
name, date_of_birth -> matric_no
This implies that not only is the matric_no sufficient to uniquely identify a student, the student’s
name combined with the date of birth is also sufficient to uniquely identify a student. It is
therefore possible to have the relation Student written as:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
As guidance in cases where a variety of keys could be selected one should try to select the
relation with the least number of attributes defined as primary keys.

55
Flattened Tables
Note that the student table shown above explicitly identifies the repeating group. It is also
possible that the table presented will be what is called a flat table, where the repeating group is
not explicitly shown:
Student #2 - Flattened Table

matric_no name date_of_birth Subject grade


960100 Smith, J 14/11/1977 Databases C
960100 Smith, J 14/11/1977 Soft_Dev A
960100 Smith, J 14/11/1977 ISDE D
960105 White, A 10/05/1975 Soft_Dev B
960105 White, A 10/05/1975 ISDE B
960120 Moore, T 11/03/1970 Databases A
960120 Moore, T 11/03/1970 Soft_Dev B
960120 Moore, T 11/03/1970 Workshop C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973 Databases B
960150 Black, D 21/08/1973 Soft_Dev D
960150 Black, D 21/08/1973 ISDE C
960150 Black, D 21/08/1973 Workshop B

The table still shows the same data as the previous example, but the format is different. We have
removed the repeating group (which is good) but we have introduced redundancy (which is bad).
Sometimes you will miss spotting the repeating group, so you may produce something like the
following relation for the Student data.
Student(matric_no, name, date_of_birth, subject, grade )
matric_no -> name, date_of_birth
name, date_of_birth -> matric_no
This data does not explicitly identify the repeating group, but as you will see the result of the
normalisation process on this relation produces exactly the same relations as the normalisation of
the version that explicitly does have a repeating group.
4.5.1 First Normal Form

• First normal form (1NF) deals with the `shape' of the record type
• A relation is in 1NF if, and only if, it contains no repeating attributes or groups of
attributes.
• Example:
• The Student table with the repeating group is not in 1NF
• It has repeating groups, and it is called an `unnormalised table'.

Relational databases require that each row only has a single value per attribute, and so a
repeating group in a row is not allowed.
To remove the repeating group, one of two things can be done:

• either flatten the table and extend the key, or


• decompose the relation- leading to First Normal Form

56
Flatten table and Extend Primary Key

The Student table with the repeating group can be written as:

Student(matric_no, name, date_of_birth, ( subject, grade ) )

If the repeating group was flattened, as in the Student #2 data table, it would look something
like:

Student(matric_no, name, date_of_birth, subject, grade )

Although this is an improvement, we still have a problem. matric_no can no longer be the
primary key - it does not have an unique value for each row. So we have to find a new primary
key - in this case it has to be a compound key since no single attribute can uniquely identify a
row. The new primary key is a compound key (matrix_no + subject).

We have now solved the repeating groups problem, but we have created other complications.
Every repetition of the matric_no, name, and data_of_birth is redundant and liable to produce
errors.

With the relation in its flattened form, strange anomalies appear in the system. Redundant data is
the main cause of insertion, deletion, and updating anomalies.

Insertion anomaly:
With the primary key including subject, we cannot enter a new student until they have at least
one subject to study. We are not allowed NULLs in the primary key so we must have an entry in
both matric_no and subject before we can create a new record.

• This is known as the insertion anomaly. It is difficult to insert new records into the
database.
• On a practical level, it also means that it is difficult to keep the data up to date.

Update anomaly
If the name of a student were changed for example Smith, J. was changed to Green, J. this would
require not one change but many one for every subject that Smith, J. studied.

Deletion anomaly
If all of the records for the `Databases' subject were deleted from the table,we would
inadvertently lose all of the information on the student with matric_no 960145. This would be
the same for any student who was studying only one subject and the subject was deleted. Again
this problem arises from the need to have a compound primary key.

4.6 Decomposing the relation

• The alternative approach is to split the table into two parts, one for the repeating groups
and one of the non-repeating groups.

57
• the primary key for the original relation is included in both of the new relations

Record
matric_no subject grade
960100 Databases C
960100 Soft_Dev A
960100 ISDE D
960105 Soft_Dev B
960105 ISDE B
... ... ...
960150 Workshop B
Student
matric_no name date_of_birth
960100 Smith,J 14/11/1977
960105 White,A 10/05/1975
960120 Moore,T 11/03/1970
960145 Smith,J 09/01/1972
960150 Black,D 21/08/1973

• We now have two relations, Student and Record.


• Student contains the original non-repeating groups
• Record has the original repeating groups and the matric_no

Student(matric_no, name, date_of_birth )


Record(matric_no, subject, grade )
Matric_no remains the key to the Student relation. It cannot be the complete key to the new
Record relation - we end up with a compound primary key consisting of matric_no and subject.
The matric_no is the link between the two tables - it will allow us to find out which subjects a
student is studying . So in the Record relation, matric_no is the foreign key.
This method has eliminated some of the anomalies. It does not always do so, it depends on the
example chosen

• In this case we no longer have the insertion anomaly


• It is now possible to enter new students without knowing the subjects that they will be
studying
• They will exist only in the Student table, and will not be entered in the Record table until
they are studying at least one subject.
• We have also removed the deletion anomaly
• If all of the `databases' subject records are removed, student 960145 still exists in the
Student table.
• We have also removed the update anomaly

58
Student and Record are now in First Normal Form.
4.6.1 Second Normal Form
Second normal form (or 2NF) is a more stringent normal form defined as:
A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully functionally
dependent on the whole key.

Thus the relation is in 1NF with no repeating groups, and all non-key attributes must depend on
the whole key, not just some part of it. Another way of saying this is that there must be no partial
key dependencies (PKDs).
The problems arise when there is a compound key, e.g. the key to the Record relation -
matric_no, subject. In this case it is possible for non-key attributes to depend on only part of the
key - i.e. on only one of the two key attributes. This is what 2NF tries to prevent.
Consider again the Student relation from the flattened Student #2 table:
Student(matric_no, name, date_of_birth, subject, grade )

• There are no repeating groups


• The relation is already in 1NF
• However, we have a compound primary key - so we must check all of the non-key
attributes against each part of the key to ensure they are functionally dependent on it.
• matric_no determines name and date_of_birth, but not grade.
• subject together with matric_no determines grade, but not name or date_of_birth.
• So there is a problem with potential redundancies

A dependency diagram is used to show how non-key attributes relate to each part or combination
of parts in the primary key.

Figure : Dependency Diagram

• This relation is not in 2NF


• It appears to be two tables squashed into one.
• the solution is to split the relation up into its component parts.
• separate out all the attributes that are solely dependent on matric_no
• put them in a new Student_details relation, with matric_no as the primary key
• separate out all the attributes that are solely dependent on subject.
• in this case no attributes are solely dependent on subject.
• separate out all the attributes that are solely dependent on matric_no + subject
• put them into a separate Student relation, keyed on matric_no + subject

59
All attributes in each relation are fully
functionally dependent upon its primary key

These relations are now in 2NF

Figure : Dependencies after splitting


Interestingly this is the same set of relations as when we recognized that there were repeating
terms in the table and directly removed the repeating terms. It should not really matter what
process you followed when normalizing, as the end result should be similar relations.

4.6.2 Third Normal Form


3NF is an even stricter normal form and removes virtually all the redundant data :

• A relation is in 3NF if, and only if, it is in 2NF and there are no transitive functional
dependencies
• Transitive functional dependencies arise:
• when one non-key attribute is functionally dependent on another non-key attribute:
• FD: non-key attribute -> non-key attribute
• and when there is redundancy in the database

By definition transitive functional dependency can only occur if there is more than one non-key
field, so we can say that a relation in 2NF with zero or one non-key field must automatically be
in 3NF.

project_no manager address


p1 Black,B 32 High Street
Project has more than one non-key field so we must
p2 Smith,J 11 New Street
check for transitive dependency:
p3 Black,B 32 High Street
p4 Black,B 32 High Street

• address depends on the value in the manager column


• every time B Black is listed in the manager column, the address column has the value `32
High Street'. From this the relation and functional dependency can be implied as:

Project(project_no, manager, address)

manager -> address


• in this case address is transitively dependent on manager. Manager is the determinant - it
determines the value of address. It is transitive functional dependency only if all
attributes on the left of the “->” are not in the key but are all in the relation, and all

60
attributes to the right of the “->” are not in the key with at least one actually being in the
relation.
• Data redundancy arises from this
• we duplicate address if a manager is in charge of more than one project
• causes problems if we had to change the address- have to change several entries, and this
could lead to errors.
• The solution is to eliminate transitive functional dependency by splitting the table
• create two relations - one with the transitive dependency in it, and another for all of the
remaining attributes.
• split Project into Project and Manager.
• the determinant attribute becomes the primary key in the new relation
• manager becomes the primary key to the Manager relation
• the original key is the primary key to the remaining non-transitive attributes
• in this case, project_no remains the key to the new Projects table.

Project
project_no manager
p1 Black,B
p2 Smith,J
p3 Black,B
p4 Black,B
Manager manager address
Black,B 32 High Street
Smith,J 11 New Street

• Now we need to store the address only once


• If we need to know a manager's address we can look it up in the Manager relation
• The manager attribute is the link between the two tables, and in the Projects table it is
now a foreign key.
• These relations are now in third normal form.

Summary: 1NF

• A relation is in 1NF if it contains no repeating groups


• To convert an unnormalised relation to 1NF either:
• Flatten the table and change the primary key, or
• Decompose the relation into smaller relations, one for the repeating groups and one for
the non-repeating groups.
• Remember to put the primary key from the original relation into both new relations.
• This option is liable to give the best results.

61
Summary: 2NF

• A relation is in 2NF if it contains no repeating groups and no partial key functional


dependencies
• Rule: A relation in 1NF with a single key field must be in 2NF
• To convert a relation with partial functional dependencies to 2NF. create a set of new
relations:
• One relation for the attributes that are fully dependent upon the key.
• One relation for each part of the key that has partially dependent attributes

Summary: 3NF

• A relation is in 3NF if it contains no repeating groups, no partial functional dependencies,


and no transitive functional dependencies
• To convert a relation with transitive functional dependencies to 3NF, remove the
attributes involved in the transitive dependency and put them in a new relation
• Rule: A relation in 2NF with only one non-key attribute must be in 3NF
• In a normalised relation a non-key field must provide a fact about the key, the whole key
and nothing but the key.
• Relations in 3NF are sufficient for most practical database design problems. However,
3NF does not guarantee that all anomalies have been removed.

4.6.3 Normalisation - BCNF

Overview

• normalise a relation to Boyce Codd Normal Form (BCNF)


• Normalisation example

Boyce-Codd Normal Form (BCNF)

• When a relation has more than one candidate key, anomalies may result even though the
relation is in 3NF.
• 3NF does not deal satisfactorily with the case of a relation with overlapping candidate
keys
• i.e. composite candidate keys with at least one attribute in common.
• BCNF is based on the concept of a determinant.
• A determinant is any attribute (simple or composite) on which some other attribute is
fully functionally dependent.
• A relation is in BCNF is, and only if, every determinant is a candidate key.

Consider the following relation and determinants.


R(a,b,c,d)
a,c -> b,d
a,d -> b

62
Here, the first determinant suggests that the primary key of R could be changed from a,b to a,c. If
this change was done all of the non-key attributes present in R could still be determined, and
therefore this change is legal. However, the second determinant indicates that a,d determines b,
but a,d could not be the key of R as a,d does not determine all of the non key attributes of R (it
does not determine c). We would say that the first determinate is a candidate key, but the second
determinant is not a candidate key, and thus this relation is not in BCNF (but is in 3rd normal
form).
Normalisation to BCNF - Example 1

Patient No Patient Name Appointment Id Time Doctor


1 John 0 09:00 Zorro
2 Kerr 0 09:00 Killer
3 Adam 1 10:00 Zorro
4 Robert 0 13:00 Killer
5 Zane 1 14:00 Zorro
Lets consider the database extract shown above. This depicts a special dieting clinic where the
each patient has 4 appointments. On the first they are weighed, the second they are exercised, the
third their fat is removed by surgery, and on the fourth their mouth is stitched closed… Not all
patients need all four appointments! If the Patient Name begins with a letter before “P” they get a
morning appointment, otherwise they get an afternoon appointment. Appointment 1 is either
09:00 or 13:00, appointment 2 10:00 or 14:00, and so on. From this (hopefully) make-believe
scenario we can extract the following determinants:

DB(Patno,PatName,appNo,time,doctor)
Patno -> PatName
Patno,appNo -> Time,doctor
Time -> appNo
Now we have to decide what the primary key of DB is going to be. From the information we
have, we could chose:
DB(Patno,PatName,appNo,time,doctor) (example 1a)
or
DB(Patno,PatName,appNo,time,doctor) (example 1b)
Example 1a - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)
2NF Eliminate partial key dependencies
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)

• 3NF Eliminate transitive dependencies

None: so just as 2NF

63
• BCNF Every determinant is a candidate key
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
• Go through all determinates where ALL of the left hand attributes are present in a
relation and at least ONE of the right hand attributes are also present in the relation.
• Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
• Patno,appNo -> Time,doctor
All LHS present, and time and doctor also present, so relevant. Is this a candidate key?
Patno,appNo IS the key, so this is a candidate key. Thus this is OK for BCNF
compliance.
• Time -> appNo
Time is present, and so is appNo, so relevant. Is this a candidate key. If it was then we
could rewrite DB as:
DB(Patno,appNo,time,doctor)
This will not work, as you need both time and Patno together to form a unique key. Thus
this determinate is not a candidate key, and therefore DB is not in BCNF. We need to fix
this.
• BCNF: rewrite to
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)

time is enough to work out the appointment number of a patient. Now BCNF is satisfied, and the
final relations shown are in BCNF.
Example 1b - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)

• 2NF Eliminate partial key dependencies

DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)

• 3NF Eliminate transitive dependencies

None: so just as 2NF

• BCNF Every determinant is a candidate key


• DB(Patno,time,doctor)

64
• R1(Patno,PatName)
• R2(time,appNo)
• Go through all determinates where ALL of the left hand attributes are present in a
relation and at least ONE of the right hand attributes are also present in the relation.
• Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
• Patno,appNo -> Time,doctor
Not all LHS present, so not relevant.
• Time -> appNo
Time is present, and so is appNo, so relevant. This is a candidate key. However, Time is
currently the key for R2, so satisfies the rules for BCNF.
• BCNF: as 3NF
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)

Summary - Example 1
This example has demonstrated three things:

• BCNF is stronger than 3NF, relations that are in 3NF are not necessarily in BCNF
• BCNF is needed in certain situations to obtain full understanding of the data model
• there are several routes to take to arrive at the same set of relations in BCNF.
• Unfortunately there are no rules as to which route will be the easiest one to take.

Example 2
Grade_report(StudNo,StudName,(Major,Adviser,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

• Functional dependencies

StudNo -> StudName


CourseNo -> Ctitle,InstrucName
InstrucName -> InstrucLocn
StudNo,CourseNo,Major -> Grade
StudNo,Major -> Advisor
Advisor -> Major

• Unnormalised

Grade_report(StudNo,StudName,(Major,Advisor,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

• 1NF Remove repeating groups

Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)

65
StudCourse(StudNo,Major,CourseNo,
Ctitle,InstrucName,InstructLocn,Grade)
• 2NF Remove partial key dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName,InstructLocn)
• 3NF Remove transitive dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)

• BCNF Every determinant is a candidate key


• Student : only determinant is StudNo
• StudCourse: only determinant is StudNo,Major
• Course: only determinant is CourseNo
• Instructor: only determinant is InstrucName
• StudMajor: the determinants are
• StudNo,Major, or
• Adviser

Only StudNo,Major is a candidate key.

BCNF

Student(StudNo,StudName)

StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)
StudMajor(StudNo,Advisor)
Adviser(Adviser,Major)
Problems BCNF overcomes
STUDENT MAJOR ADVISOR
123 PHYSICS EINSTEIN
123 MUSIC MOZART
456 BIOLOGY DARWIN
789 PHYSICS BOHR
999 PHYSICS EINSTEIN

• If the record for student 456 is deleted we lose not only information on student 456 but
also the fact that DARWIN advises in BIOLOGY

66
• we cannot record the fact that WATSON can advise on COMPUTING until we have a
student majoring in COMPUTING to whom we can assign WATSON as an advisor.

In BCNF we have two tables:


STUDENT ADVISOR
123 EINSTEIN
123 MOZART
456 DARWIN
789 BOHR
999 EINSTEIN
ADVISOR MAJOR
EINSTEIN PHYSICS
MOZART MUSIC
DARWIN BIOLOGY
BOHR PHYSICS

Returning to the ER Model

• Now that we have reached the end of the normalisation process, you must go back and
compare the resulting relations with the original ER model
• You may need to alter it to take account of the changes that have occurred during the
normalisation process Your ER diagram should always be a prefect reflection of the
model you are going to implement in the database, so keep it up to date!
• The changes required depends on how good the ER model was at first!

Normalisation Example
Library
Consider the case of a simple video library. Each video has a title, director, and serial number.
Customers have a name, address, and membership number. Assume only one copy of each video
exists in the library. We are given:
video(title,director,serial)
customer(name,addr,memberno)
hire(memberno,serial,date)

title->director,serial
serial->title
serial->director
name,addr -> memberno
memberno -> name,addr
serial,date -> memberno
What normal form is this?

67
• No repeating groups, so at least 1NF
• 2NF? There is a composite key in hire. Investigate further... Can memberno in hire be
found with just serial or just date. NO. Therefore relation is in at least 2NF.
• 3NF? serial->director is a non-key dependency. Therefore the relations are currently in
2NF.

Convert from 2NF to 3NF.


Rewrite
video(title,director,serial)
To
video(title,serial)
serial(serial,director)

Therefore the new relations become:


video(title,serial)
serial(serial,director)
customer(name,addr,memberno)
hire(memberno,serial,date)
In BCNF? Check if every determinant is a candidate key.
video(title,serial)
Determinants are:
title->director,serial Candidate key
serial->title Candidate key
video in BCNF

serial(serial,director)
Determinants are:
serial->director Candidate key
serial in BCNF

customer(name,addr,memberno)
Determinants are:
name,addr -> memberno Candidate key
memberno -> name,addr Candidate key
customer in BCNF

hire(memberno,serial,date)
Determinants are:
serial,date -> memberno Candidate key
hire in BCNF
Therefore the relations are also now in BCNF.

68
5 Relational Algebra

In order to implement a DBMS, there must exist a set of rules which state how the database
system will behave. For instance, somewhere in the DBMS must be a set of statements which
indicate than when someone inserts data into a row of a relation, it has the effect which the user
expects. One way to specify this is to use words to write an `essay' as to how the DBMS will
operate, but words tend to be imprecise and open to interpretation. Instead, relational databases
are more usually defined using Relational Algebra.
The relational algebra is a procedural query language that consists of a set of operations that take
one or two relations as input and produce a new relation as their result. Each query describes a
step-by-step procedure for computing the desired answer. Operators may be unary(operate on
one relation) or binary.

Relational Algebra is :

• the formal description of how a relational database operates


• an interface to the data stored in the database itself
• the mathematics which underpin SQL operations

Operators in relational algebra are not necessarily the same as SQL operators, even if they have
the same name. For example, the SELECT statement exists in SQL, and also exists in relational
algebra. These two uses of SELECT are not the same. The DBMS must take whatever SQL
statements the user types in and translate them into relational algebra operations before applying
them to the database.
Terminology

• Relation - a set of tuples.


• Tuple - a collection of attributes which describe some real world entity.
• Attribute - a real world role played by a named domain.
• Domain - a set of atomic values.
• Set - a mathematical definition for a collection of objects which contains no duplicates.

•The Relational Algebra


•Fundamental Operations
•Unary:
–Selection
–Projection
–Renaming
•Binary:
–Union
–Set difference
– Cartesian product (Cross product)

69
Operators - Write

• INSERT - provides a list of attribute values for a new tuple in a relation. This operator is
the same as SQL.
• DELETE - provides a condition on the attributes of a relation to determine which tuple(s)
to remove from the relation. This operator is the same as SQL.
• MODIFY - changes the values of one or more attributes in one or more tuples of a
relation, as identified by a condition operating on the attributes of the relation. This is
equivalent to SQL UPDATE.

Operators - Retrieval
There are two groups of operations:

• Mathematical set theory based relations:


UNION, INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT.
• Special database operations:
SELECT (not the same as SQL SELECT), PROJECT, and JOIN.

Relational SELECT

SELECT is used to obtain a subset of the tuples of a relation that satisfy a select condition.
For example, find all employees born after 1st Jan 1950:
SELECTdob '01/JAN/1950'(employee)
Relational PROJECT
The PROJECT operation is used to select a subset of the attributes of a relation by specifying the
names of the required attributes.
For example, to get a list of all employees surnames and employee numbers:
PROJECTsurname,empno(employee)

SELECT and PROJECT


SELECT and PROJECT can be combined together. For example, to get a list of employee
numbers for employees in department number 1:

Figure : Mapping select and project


5.1 Set Operations - semantics
Consider two relations R and S.

• UNION of R and S
the union of two relations is a relation that includes all the tuples that are either in R or in
S or in both R and S. Duplicate tuples are eliminated.

70
• INTERSECTION of R and S
the intersection of R and S is a relation that includes all tuples that are both in R and S.
• DIFFERENCE of R and S
the difference of R and S is the relation that contains all the tuples that are in R but that
are not in S.

SET Operations - requirements


For set operations to function correctly the relations R and S must be union compatible. Two
relations are union compatible if

• they have the same number of attributes


• the domain of each attribute in column order is the same in both R and S.

UNION Example

Figure : UNION
INTERSECTION Example

Figure : Intersection

71
DIFFERENCE Example

Figure : DIFFERENCE
CARTESIAN PRODUCT
The Cartesian Product is also an operator which works on two sets. It is sometimes called the
CROSS PRODUCT or CROSS JOIN.
It combines the tuples of one relation with all the tuples of the other relation.

CARTESIAN PRODUCT example

Figure : CARTESIAN PRODUCT


JOIN Operator
JOIN is used to combine related tuples from two relations:

• In its simplest form the JOIN operator is just the cross product of the two relations.
• As the join becomes more complex, tuples are removed within the cross product to make
the result of the join more meaningful.
• JOIN allows you to evaluate a join condition between the attributes of the relations on
which the join is undertaken.

The notation used is


R JOINjoin condition S

72
JOIN Example

Figure : JOIN
Natural Join
Invariably the JOIN involves an equality test, and thus is often described as an equi-join. Such
joins result in two attributes in the resulting relation having exactly the same value. A `natural
join' will remove the duplicate attribute(s).

• In most systems a natural join will require that the attributes have the same name to
identify the attribute(s) to be used in the join. This may require a renaming mechanism.
• If you do use natural joins make sure that the relations do not have two attributes with the
same name by accident.

OUTER JOINs
Notice that much of the data is lost when applying a join to two relations. In some cases this lost
data might hold useful information. An outer join retains the information that would have been
lost from the tables, replacing missing data with nulls.
There are three forms of the outer join, depending on which data is to be kept.

• LEFT OUTER JOIN - keep data from the left-hand table


• RIGHT OUTER JOIN - keep data from the right-hand table
• FULL OUTER JOIN - keep data from both tables

OUTER JOIN example 1

73
Figure : OUTER JOIN (left/right)
OUTER JOIN example 2

Figure : OUTER JOIN (full)

74
6 Concurrency using Transactions
The goal in a `concurrent' DBMS is to allow multiple users to access the database simultaneously
without interfering with each other.

A problem with multiple users using the DBMS is that it may be possible for two users to try and
change data in the database simultaneously. If this type of action is not carefully controlled,
inconsistencies are possible.

To control data access, we first need a concept to allow us to encapsulate database accesses.
Such encapsulation is called a `Transaction'.

6.1 Transactions
Introduction
We can classify a database system according to number of users who can use the system
concurrently.
• Single user
• Multi-user

Multiple users can access database and use computer simultaneously because of concept
of multiprogramming, which allows a computer to execute many programs or processes.
Concurrent execution of processes is actually interleaved.

• Transaction (ACID) is a logical unit of database processing that includes one or


more database operations
• Unit of logical work and recovery
o A - atomicity (for integrity)
o C - consistency preservation
o I - isolation
o D - durability
• Available in SQL
• The database operations that form a transaction can either be embedded within
application program or can be specified interactively via high-level query language
such as SQL.

• Some applications require nested or long transactions

What make transactions?

o Beginning.
o Sequence of read and write operations.
o Ending - either committed or aborted.

75
After work is performed in a transaction, two outcomes are possible:

• Commit - Any changes made during the transaction by this transaction are committed to
the database.
• Abort - All the changes made during the transaction by this transaction are not made to
the database. The result of this is as if the transaction was never started.

6.1.1 Properties of transactions that a DBMS must maintain

• Transactions have ACID semantics.


o Atomicity - all or nothing that is, each action carried out or none.
o Consistency - transaction takes database from one consistent state to another.
o Isolation - concurrent transactions don't interfere with each other.
o Durability - Once committed, changes are permanent.

• Concurrency control deals with consistency and isolation (more later...)


• Recovery deals with atomicity, durability, and consistency (more later...).
• A separate "log file" is used for concurrency control and recovery.
o Each operation is written to the log.
o At some point the log is written to stable-storage (disk).
o Recovery uses the log to reconstruct the database if necessary.

6.1.2 Transaction Schedules


Schedule is a sequential ordering of operations. It can also be considered as a list of actions
(reading, writing, aborting or committing) from a set of transactions, and order in which two
actions of a transaction T appear in a schedule must be the same as the order in which they
appear in T.
• Updates with concurrent access are the complicating factor.
• Each transaction has a serial schedule of its operations.
• A global schedule might be an interleaving of concurrent transaction schedules.
• The goal is a global schedule that preserves the consistency of the database.

A transaction schedule is a tabular representation of how a set of transactions were executed over
time. This is useful when examining problem scenarios. Within the diagrams various
nomenclatures are used:

• READ(a) - This is a read action on an attribute or data item called `a'.


• WRITE(x,a) - This is a write action on an attribute or data item called `a', where the value
`x' is written into `a'.
• tn (e.g. t1,t2,t10) - This indicates the time at which something occurred. The units are not
important, but tn always occurs before tn+1.

76
Consider transaction A, which loads in a bank account balance X (initially 20) and adds 10
pounds to it. Such a schedule would look like this:
Time Transaction A
t1 TOTAL:=READ(X)
t2 TOTAL:=TOTAL+10
t3 WRITE(TOTAL,X)
Now consider that, at the same time as transaction A runs, transaction B runs. Transaction B
gives all accounts a 10% increase. Will X be 32 or 33?
Value Value
Time Transaction A Transaction B
TOTAL BALANCE
t1 BALANCE:=READ(X) 20
t2 TOTAL:=READ(X) 20
t3 TOTAL:=TOTAL+10 30
t4 WRITE(TOTAL,X) 30
t5 BALANCE:=BALANCE*110% 22
t6 WRITE(BALANCE,X) 22
Whoops... X is 22! Depending on the interleaving, X can also be 32, 33, or 30. Lets classify
erroneous scenarios.

6.2 Lost Update scenario.


Time Transaction A Transaction B
t1 X=READ(R)
t2 Y=READ(R)
t3 WRITE(X,R)
t4 WRITE(Y,R)
Transaction A's update is lost at t4, because Transaction B overwrites it. B missed A's update at
t3 as it got the value of R at t2.

6.3 Uncommitted Dependency


Time Transaction A Transaction B
t1 WRITE(X,R)
t2 Y=READ(R)
t3 ABORT
Transaction A is allowed to READ (or WRITE) item R which has been updated by another
transaction but not committed (and in this case ABORTed).

Inconsistency

Time X Y Z Transaction A Transaction B

77
Action SUM
t1 40 50 30 SUM:=READ(X) 40
t2 40 50 30 SUM+=READ(Y) 90
t3 40 50 30 ACC1=READ(Z)
t4 40 50 20 WRITE(ACC1-10,Z)
t5 40 50 20 ACC2=READ(X)
t6 50 50 20 WRITE(AC2+10,X)
t7 50 50 20 COMMIT
t7 50 50 20 SUM+=READ(Z) 110
SUM should have been 120...

6.3.1 Serialisability

• A `schedule' is the actual execution sequence of two or more concurrent transactions.


• A schedule of two transactions T1 and T2 is `serialisable' if and only if executing this
schedule has the same effect as either T1;T2 or T2;T1.

We noted earlier that a serial schedule preserves consistency.

• Allow non-serial schedules that can be proven to be equivalent to a serial schedule of


committed transactions.
• "Conflict Serializable" schedule - consistent if we can reorder non-conflicting operations
to get a serial schedule

Precedence Graph
In order to know that a particular transaction schedule can be serialized, we can draw a
precedence graph. This is a graph of nodes and vertices, where the nodes are the transaction
names and the vertices are attribute collisions.
The schedule is said to be serialised if and only if there are no cycles in the resulting diagram.
Precedence Graph : Method
To draw one;

• Draw a node for each transaction in the schedule


• Where transaction A writes to an attribute which transaction B has read from, draw an
line pointing from B to A.
• Where transaction A writes to an attribute which transaction B has written to, draw a line
pointing from B to A.
• Where transaction A reads from an attribute which transaction B has written to, draw a
line pointing from B to A.

Example 1
Consider the following schedule:

78
Time TRAN1 TRAN2
t1 READ(A)
t2 READ(B)
t3 READ(A)
t4 READ(B)
t5 WRITE(x,B)
t6 WRITE(y,B)

Example 2
Consider the following schedule:

Time TRAN1 TRAN2 TRAN3


t1 READ(A)
t2 READ(B)
t3 READ(A)
t4 READ(B)
t5 WRITE(x,A)
t6 WRITE(v,C)
t7 WRITE(w,B)
t8 WRITE(z,C)

• Serial Schedules - trivially consistent


– Transactions run sequentially - T1, T2, ... Tn
– No interleaving of transaction operations, so no true concurrent execution.
– Transactions can be reordered - T5, T1, T3, ...
– Trivially preserves database consistency since each transaction preserves
consistency.

6.3.2 Concurrency Locking

Many systems use locking mechanisms for concurrency control. When a transaction needs an
assurance that some object will not change in some unpredictable manner, it acquires a lock on
that object.

79
• A transaction holding a read lock is permitted to read an object but not to change it.
• More than one transaction can hold a read lock for the same object.
• Usually, only one transaction may hold a write lock on an object.
• On a transaction schedule, we use `S' to indicate a shared lock, and `X' for an exclusive
write lock.
• A DBMS must be able to ensure that only serializable, recoverable schedules are
allowed.
• No action of committed transaction are lost while undoing aborted transactions.
• A DBMS uses locking protocol to achieve this.
• A locking protocol is a set of rules to be followed b y each transaction( and enforced y
the DBMS), in order to ensure that even though actions of several transactions might be
interleaved, the net effect is identical to executing all transactions in some serial order.

Locking is used to prevent concurrency conflicts. Lock is a binary variable - locked/unlocked


- similar to OS semaphores (railroad train control). Shared locks can be held on a data item
D by multiple transactions simultaneously. Also called "read" locks Can be upgraded to
exclusive lock

Strict Two-phase Locking ( Strict 2PL)

•Most widely used locking protocol.


•Has the following rules:
–If a transaction T wants to read (or modify) an object it first requests a shared (or exclusive)
lock of the object.
–All locks held by a transaction are released when the transaction is completed.
•Allow only serializable schedules.

Locking - Uncommitted Dependency

Locking solves the uncommitted dependency problem.


Time Transaction A Transaction B Lock on R
t1 WRITE(p,R) -=X
t2 READ R (WAIT) X
t3 ...wait... ABORT X=-
t4 READ R (CONT) -=S

6.3.3 Deadlock
Deadlock can arise when locks are used, and causes all related transactions to WAIT forever...
time Transaction A Transaction B Lock State
X Y
t1 WRITE(p,X) -=X -

80
t2 WRITE(q,Y) X -=X
t3 READ(Y) (WAIT) X X
t3 ...WAIT... READ(X) (WAIT) X X
t3 ...WAIT... ...WAIT... X X
The `lost update' senario results in deadlock with locks. So does the `inconsistency' scenario.

time Transaction A Transaction B


Lock R
t1 READ(R) -=S
t2 READ(R) S=S
t3 WRITE(p,R) (WAIT) S
t3 ...wait... WRITE(q,R) (WAIT) S
t3 ...wait... ...wait... S

Deadlock arises when:


1. A transaction is holding an exclusive lock that some other transaction is waiting for.
2. The wait is circular, that is Ti is waiting (directly or indirectly) on Tj and vice versa.
• Two alternatives:
1. Prevent deadlock from happening.
2. Detect deadlock when it happens

6.3.4 Deadlock Handling

• Deadlock avoidance
o pre-claim strategy used in operating systems
o not effective in database environments.

Force transactions to acquire all locks at the beginning of the transaction.

o Difficult to be omniscient.
o Reduces concurrency.
o Some transactions might never start - livelock (starvation).
• Preempt transactions that might lead to a deadlock. When lock conflicts arise:
• Wait die: only older transactions wait for younger transactions to finish. Young
transactions waiting on older transactions are killed and later restarted.
• Wound wait: young transactions wait for old to finish. Old transactions waiting on
younger are killed and later restarted.
• Using these rules, some transaction will always be making progress.
• No deadlock possible since we kill any transaction that might cause deadlock.

• Deadlock detection
o whenever a lock requests a wait, or on some perodic basis.

81
o if a transaction is blocked due to another transaction, make sure that that
transaction is not blocked on the first transaction, either directly or indirectly via
another transaction.
o Periodically* check for deadlock - if it has occurred, break it.
o Construct a wait-for graph. Nodes are transactions, edges from Ti to Tj if Ti is
waiting on a lock held by Tj.
o Any cycles in graph indicate a deadlock.
o *Could trigger inspection by ‘watchdog’ protocol.

Deadlock Resolution
If a set of transactions is considered to be deadlocked:

• choose a victim (e.g. the shortest-lived transaction)


• rollback `victim' transaction and restart it.
• The rollback terminates the transaction, undoing all its updates and releasing all of its
locks.
• A message is passed to the victim and depending on the system the transaction may or
may not be started again automatically.

6.3.5 Two-Phase Locking


The presence of locks does not guarantee serialisability. If a transaction is allowed to release
locks before the transaction has completed, and is also allowed to acquire more (or even the
same) locks later then the benifit or locking is lost.
If all transactions obey the `two-phase locking protocol', then all possible interleaved executions
are guaranteed serialisable.
The two-phase locking protocol:

• Before operating on any item, a transaction must acquire at least a shared lock on that
item. Thus no item can be accessed without first obtaining the correct lock.
• After releasing a lock, a transaction must never go on to acquire any more locks.

The technical names for the two phases of the locking protocol are the `lock-acquisition phase'
and the `lock-release phase'.
6.3.6 Other Database Consistency Methods
Two-phase locking is not the only approach to enforcing database consistency. Another method
used in some DMBS is timestamping. With timestamping, there are no locks to prevent
transactions seeing uncommitted changes, and all physical updates are deferred to commit time.

• locking synchronises the interleaved execution of a set of transactions in such a way that
it is equivalent to some serial execution of those transactions.
• timestamping synchronises that interleaved execution in such a way that it is equivalent
to a particular serial order - the order of the timestamps.

82
Timestamping rules
The following rules are checked when transaction T attempts to change a data item. If the rule
indicates ABORT, then transaction T is rolled back and aborted (and perhaps restarted).

• If T attempts to read a data item which has already been written to by a younger
transaction then ABORT T.
• If T attempts to write a data item which has been seen or written to by a younger
transaction then ABORT T.

If transaction T aborts, then all other transactions which have seen a data item written to by T
must also abort. In addition, other aborting transactions can cause further aborts on other
transactions. This is a `cascading rollback'.

Why Concurrent Control is needed


Several problems can occur when concurrent transactions execute in an uncontrolled manner:
• The lost update problem: occurs when two transactions that access the same database items
have their operations interleaved in a way that makes the values of some database item
incorrect.
• The temporary update ( or Dirt Read) problem: occurs when one transaction updates a
database item and then the transaction fails for some reason.
The update item is accessed by another transaction before it is changed back to its original value(
dirty data).

•The incorrect summary problem: If one transaction is calculating an aggregate summary


function on a number of records while other transactions are updating some of these records, the
aggregate function may calculate some values before they are updated and others after they are
updated
•Unrepeatable read: a transaction reads an item twice and the item is changed by another
transaction between the two reads.
•The transaction receives different values for its two reads of the same item.

6.4 Crash Recovery


A database might be left in an inconsistent state when:

• deadlock has occurred.


• a transaction aborts after updating the database.
• software or hardware errors.
• incorrect updates have been applied to the database.

If the database is in an inconsistent state, it is necessary to recover to a consistent state. The basis
of recovery is to have backups of the data in the database.

83
• The recovery manager of a DBMS is responsible for ensuring transaction atomicity and
durability.
• Atomicity- by undoing the actions of transactions that do not commit.
• Durability: by making sure that all actions of committed transactions survive system
crash and media failures (e.g. a disk is corrupted)
• When s DBMS is restarted after crash, the recovery manager is given control and must bring
the database to a consistent state.
• Recovery manager also responsible for undoing actions of an aborted transaction.
• The transaction manager of a DBMS controls execution of transactions.
• Atomic Writes: Writing a page to disk in anatomic action.

6.4.1 Why Recovery is Needed

• Whenever a transaction is submitted to a DBMS for execution, the system is responsible for
making sure that:
• All the operations in a transaction are completed successfully and their effect is recorded
permanently in the database.
• The transaction has no effect whatsoever on the data base or any other transaction
• This may not be achieve if the transaction fails after executing some of its operations but
before executing all of them.

Types of Failures

• A computer failure (System crash): Hardware, software or network error occurs during
transaction execution.
• A transaction or system error: Some operations in a transaction may cause it to fail e.g.
integer overflow or division by zero.
• Local errors or exception conditions detected by transaction: conditions that occur during
executing that may require cancellation of the transaction.e.g insufficient account balance
may cause a transaction such as withdrawal to be cancelled.
• Concurrency control enforcement: The CC method may decide to abort the transaction, to be
restarted later, because it violates serializability.
• Disk failure: due to read write protection or read write head crash. May happen during
read/write operation.
• Physical problems and catastrophes: include power or air conditioning failure, fire theft,
sabotage, overwriting disks e.t.c.

Recovery from a transaction failure means that the database is restored to the most recent
consistent state just before the time of failure. The system must keep information about the
changes that were applied to data items by the various transactions. This information is typically
stored in a system log

6.4.2 Typical strategy for recovery:

• If there is extensive damage e.g. catastrophic failures, the recovery method a past copy of
the database that was backed up to archival storage( typically tape) and reconstructs a

84
more current state by reapplying or redoing the operations of committed transactions
from the backed up log.
• When the database is not physically damaged, the strategy is to reverse any changes that
caused the inconsistency by undoing some operations.

 Deferred update: Do not physically update the database on disk until after a transaction
reaches its commit point, then the updates are recorded in the database
 Immediate Update: The database may be updated by some operations before the
operation reaches its commit point.

Recovery: the dump


The simplest backup technique is `the Dump'.

• entire contents of the database is backed up to an auxiliary store.


• must be performed when the state of the database is consistent - therefore no transactions
which modify the database can be running
• dumping can take a long time to perform
• you need to store the data in the database twice.
• as dumping is expensive, it probably cannot be performed as often as one would like.
• a cut-down version can be used to take `snapshots' of the most volatile areas.

Recovery: the transaction log


A technique often used to perform recovery is the transaction log or journal.

• records information about the progress of transactions in a log since the last consistent
state.
• the database therefore knows the state of the database before and after each transaction.
• every so often database is returned to a consistent state and the log may be truncated to
remove committed transactions.
• when the database is returned to a consistent state the process is often referred to as
`checkpointing'.

• The disks are physically or logically damaged then recovery from the log is impossible
and instead a restore from a dump is needed.
• If the disks are OK then the database consistency must be maintained. Writes to the disk
which was in progress at the time of the failure may have only been partially done.
• Parse the log file, and where a transaction has been ended with `COMMIT' apply the data
part of the log to the database.
• If a log entry for a transaction ends with anything other than COMMIT, do nothing for
that transaction.
• flush the data to the disk, and then truncate the log to zero.
• the process or reapplying transaction from the log is sometimes referred to as
`rollforward'.

85
6.4.3 Immediate Update

Immediate Database Modification

•Update database any time after log record written.


–If crash before commit, read log and undo uncommited transactions.
–If crash after commit, read log and redo transactions
• Immediate Database Modification
•Benefits:
–Can flush modifications to disk anytime.
–Speeds up time-to-commit.
–If updates made before commit, no redo required during recovery.
•Drawbacks:
–Transaction abort is costlier.
–Slower
Immediate update, or UNDO/REDO, is another algorithm to support ABORT and machine
failure scenarios.

• While a transaction runs, changes made by that transaction can be written to the database
at any time. However, the original and the new data being written must both be stored in
the log BEFORE storing it on the database disk.
• On a commit:
• All the updates which has not yet been recorded on the disk is first stored in the log file
and then flushed to disk.
• The new data is then recorded in the database itself.
• On an abort, REDO all the changes which that transaction has made to the database disk
using the log entries.
• On a system restart after a failure, REDO committed changes from log.

Example
Using immediate update, and the transaction TRAN1 again, the process is:
Time Action LOG
t1 START -
t2 read(A) -
t3 write(10,B) Was B == 6, now 10
t4 write(20,C) Was C == 2, now 20
t5 COMMIT COMMIT

Before During After


DISK B=6 B = 10 B = 10
A=5 C=2 A=5 C=2 A = 5 C = 20

86
If the DMBS fails and is restarted:

• The disks are physically or logically damaged then recovery from the log is impossible
and instead a restore from a dump is needed.
• If the disks are OK then the database consistency must be maintained. Writes to the disk
which was in progress at the time of the failure may have only been partially done.
• Parse the log file, and where a transaction has been ended with `COMMIT' apply the
`new data' part of the log to the database.
• If a log entry for a transaction ends with anything other than COMMIT, apply the `old
data' part of the log to the database.
• flush the data to the disk, and then truncate the log to zero.

6.4.4 Rollback
The process of undoing changes done to the disk under immediate update is frequently referred
to as rollback.

• Where the DBMS does not prevent one transaction from reading uncommitted
modifications (a `dirty read') of another transaction (i.e. the uncommitted dependency
problem) then aborting the first transaction also means aborting all the transactions which
have performed these dirty reads.
• as a transaction is aborted, it can therefore cause aborts in other dirty reader transactions,
which in turn can cause other aborts in other dirty reader transaction. This is referred to
as `cascade rollback'.

87
7 DBMS Implementation
7.1 Implementing a DBMS
A database management system handles the requests generated from the SQL interface,
producing or modifying data in response to these requests. This involves a multilevel processing
system.

Figure : DBMS Execution and Parsing


This level structure processes the SQL submitted by the user or application.

• Parser: The SQL must be parsed and tokenised. Syntax errors are reported back to the
user. Parsing can be time consuming, so good quality DBMS implementations cache
queries after they have been parsed so that if the same query is submitted again the
cached copy can be used instead. To make the best use of this most systems use
placeholders in queries, like:
• SELECT empno FROM employee where surname = ?

The '?' character is prompted for when the query is executed, and can be supplied
separately from the query by the API used to inject the SQL. The parameter is not part of
the parsing process, and thus once this query is parsed once it need not be parsed again.

88
• Executer: This takes the SQL tokens and basically translates it into relational algebra.
Each relational algebra fragment is optimised, and the passed down the levels to be acted
on.
• User: The concept of the user is required at this stage. This gives the query context, and
also allows security to be implemented on a per-user basis.
• Transactions: The queries are executed in the transaction model. The same query from
the same user can be executing multiple times in different transactions. Each transaction
is quite separate.
• Tables : The idea of the table structure is controlled at a low level. Much security is based
on the concept of tables, and the schema itself is stored in tables, as well as being a set of
tables itself.
• Table cache: Disks are slow, yet a disk is the best way of storing long-term data. Memory
is much faster, so it makes sense to keep as much table information as possible in
memory. The disk remains synchronised to memory as part of the transaction control
system.
• Disks : Underlying almost all database systems is the disk storage system. This provides
storage for the DBMS system tables, user information, schema definitions, and the user
data itself. It also provides the means for transaction logging.

The 'user' context is handled in a number of different ways, depending on the database system
being used. The following diagram gives you an idea of the approach followed by two different
systems, Oracle and MySQL.

Figure : Users and Tablespaces


All users in a system have login names and passwords. In Oracle, during the connection phase, a
database name must be provided. This allows one Oracle DBMS to run multiple databases, each
of which is effectively isolated from each other.
Once a user is connected using a username and password, MySQL places the user in a particular
tablespace in the database. The name of the tablespace is independent of the same. In Oracle,
tablespaces and usernames are synonymous, and thus you should really be thinking of different
usernames for databases that serve different purposes. In MySQL the philosophy is more like a
username is a person, and that person may want to do a variety of tasks.
Once in a tablespace, a number of tables are visible, and in each table columns are visible.

89
In both approaches, tables in other tablespaces can be accessed. MySQL effectively sees a
tablespace and a database being the same concept, but in Oracle the two ideas are kept slightly
more separate. However, the syntax remains the same. Just as you can access column owner of
table CAR, if it is in your own tablespace, by saying
SELECT car.owner FROM car;
You can access table CAR in another tablespace (lets call it vehicles) by saying:
SELECT vehicles.car.owner FROM vehicles.car;
The appearance of this structure is similar in concept to the idea of file directories. In a database
the directories are limited to "folder.table.column", although "folder" could be a username, a
tablename, or a database, depending on the philosophy of the database management system.
Even then, the concept is largely similar.
7.1.1 Disk and Memory
The tradeoff between the DBMS using Disk or using main memory should be understood...
Issue Main Memory VS Disk
Speed Main memory is at least 1000 times faster than Disk
Storage
Disk can hold hundreds of times more information than memory for the same cost
Space
When the power is switched off, Disk keeps the data, main memory forgets
Persistence
everything
Access Time Main memory starts sending data in nanoseconds, while disk takes milliseconds
Block Size Main memory can be accessed 1 word at a time, Disk 1 block at a time
The DBMS runs in main memory, and the processor can only access data which is currently in
main memory. The handling of the differences between disk and main memory effectively is at
the heart of a good quality DBMS.
7.2 Disk Arrangements
Efficient processing of the DBMS requests requires efficient handling of disk storage. The
important aspects of this include:

• Index handling
• Transaction Log management
• Parallel Disk Requests
• Data prediction

With indexing, we are concerned with finding the data we actually want quickly and efficiently,
without having to request and read more disk blocks than absolutely necessary. There are many
approaches to this, but two of the more important ones are hash tables and binary trees.
When handling transaction logs, the discussion we have had so far has been on the theory of
these techniques. In practice, the separation of data and log is much more blurred. We will look
at one technique for implementing transaction logging, known as shadow paging.
Finally, the underlying desire of a good DBMS is to never be in a position where no further work
can be done until the disk gives us some data. Instead, by using prediction, prefetching, and
parallel disk operations, it is hoped that CPU time becomes the limiting factor.

90
7.2.1 Hash tables
A Hash table is one of the simplest index structures which a database can implement. The major
components of a hash index is the "hash function" and the "buckets". Effectively the DBMS
constructs an index for every table you create that has a PRIMARY KEY attribute, like:

CREATE TABLE test (


id INTEGER PRIMARY KEY
,name varchar(100)
);
In table test, we have decided to store 4 rows...
insert into test values (1,'Gordon');
insert into test values (2,'Jim');
insert into test values (4,'Andrew');
insert into test values (3,'John');

The algorithm splits the places which the rows are to be stored into areas. These areas are called
buckets. If a row's primary key matches the requirements to be stored in that bucket, then that is
where it will be stored. The algorithm to decide which bucket to use is called the hash function.
For our example we will have a nice simple hash function, where the bucket number equals the
primary key. When the index is created we have to also decide how many buckets there are. In
this example we have decided on 4.

Figure : Hash Table with no collisions

Now we can find id 3 quickly and easily by visiting bucket 3 and looking into it. But now the
buckets are full. To add more values we will have to reuse buckets. We need a better hash
function based on mod 4. Now bucket 1 holds ids (1,5,9...), bucket 2 holds (2,6,10...), etc.

91
Figure : Hash Table with collisions

We have had to put more than 1 row in some of the buckets. This is called a hash collision. The
more collisions we have the longer the collision chain and the slower the system will get. For
instance, finding id 6 means visiting bucket 2, and then finding id 2, then 10, and then finally 6.
In DBMS systems we can usually ask for a hash index for a table, and also say how many
buckets we thing we will need. This approach is good when we know how many rows there is
likely to be. Most systems will handle the hash table for you, adding more buckets over time if
you have made a mistake. It remains a popular indexing technique.

7.3 Decision Support

•Definitions
•Data warehousing: Consolidate data from many sources in one large repository.
–Loading periodic synchronization of replicas.
•OLAP: A term used to analyze complex data from the data warehouse.
–Complex SQL queries and views
–Queries based on spreadsheet-style operations and multidimensional view of data.
–Use distributed computing capabilities for analyses that require more storage and processing
power.
•Data Mining: Exploratory search for interesting trends and anomalies.
•Decision Support systems (DSS)/ Executive information systems(EIS): support an
organization’s leading decision makers with high-level data for complex and important
decisions.
–Loading periodic synchronization of replicas.
•OLTP(Online transaction processing): A Supported by traditional databases and includes
insertions, updates, and deletions, while also supporting information query requirements.

7.3.1 Data Warehousing

•Data warehouse: A subject oriented, integrated, non-volatile, time-variant collection of data in


support of management decision(Inmon, 1992).

92
•Contain consolidated data from many sources, augmented with summary information and
covering a long time period.
•Has a clear distinction with traditional databases which are transactional( relational, object-
oriented, network or hierarchical).
•Data warehouses are mainly intended for decision support applications and are optimized for
data retrieval , not routine transaction processing.

Characteristics of Data Warehouses

•Multidimensional conceptual view-collection of numeric measures, each measure depends on a


set of dimensions.
•Generic dimensionality
•Unlimited dimensions and aggregation levels
•Unrestricted cross-dimensional operations
•Dynamic sparse matrix handling
•Client-server architecture
•Multi-user support
•Accessibility
•Transparency
•Intuitive data manipulation
•Consistent reporting performance
•Flexible reporting.

Order of Magnitudes

•Enterprise-wide data warehouses: huge projects requiring massive investment of time and
resources.
•Virtual data warehouses: provide views of operational databases that are materialized for
efficient access.
•Data marts: generally targeted for a subset of the organization such as department, and more
tightly focused.

7.3.2 Data Mining

Data mining consist of finding/discovering/ mining interesting trends or patterns in large


datasets.

–Related to exploratory data analysis in statistics and knowledge discovery and machine learning
in artificial intelligence.
•Knowledge discovery in databases (KDD) encompasses more than data mining .
–It includes includes data selection, data cleansing, enrichment, data transformation or encoding,
data mining and display of the discovered information.

Goals of data mining and KDD

93
•Prediction: Showing how some attributes within the data will behave in future.
•Identification: Data patterns can be used to identify the existence of an item, an event or an
activity.
•Classification: Different classes or categories can be identified based on combination of
parameters.
•Optimization: Optimize the use of limited resources such as time, space , money to maximize
output variables.

7.3.3 Binary Tree


Binary trees is the latest approach to providing indexes. It is much cleverer than hash tables, and
attempts to solve the problem of not knowing how many buckets you might need, and that some
collision chains might be much longer than others. It attempts to create indexes such that all rows
can be found in a similar number of steps through the storage blocks.
The state of the art in binary tree technology is called B+ Trees. With B+ tree, the order of the
original data is maintained in its creation order. This allows multiple B+ tree indices to be kept
for the same set of data records.

• the lowest level in the index has one entry for each data record.
• the index is created dynamically as data is added to the file.
• as data is added the index is expanded such that each record requires the same number of
index levels to reach it (thus the tree stays `balanced').
• the records can be accessed via an index or sequentially.

Each index node in a B+ Tree can hold a certain number of keys. The number of keys is often
referred to as the `order'. Unfortunately, `Order 2' and `Order 1' are frequently confused in the
database literature. For the purposes of our coursework and exam, `Order 2' means that there can

7.3.4 Index Structure and Access

• The top level of an index is usually held in memory. It is read once from disk at the start
of queries.
• Each index entry points to either another level of the index, a data record, or a block of
data records.
• The top level of the index is searched to find the range within which the desired record
lies.
• The appropriate part of the next level is read into memory from disc and searched.
• This continues until the required data is found.
• The use of indices reduce the amount of file which has to be searched.

7.3.5 Costing Index and File Access

• The major cost of accessing an index is associated with reading in each of the
intermediate levels of the index from a disk (milliseconds).

94
• Searching the index once it is in memory is comparatively inexpensive (microseconds).
• The major cost of accessing data records involves waiting for the media to recover the
required blocks (milliseconds).
• Some indexes mix the index blocks with the data blocks, which means that disk accesses
can be saved because the final level of the index is read into memory with the associated
data records.

7.3.6 Use of Indexes

• A DBMS may use different file organisations for its own purposes.
• A DBMS user is generally given little choice of file type.
• A B+ Tree is likely to be used wherever an index is needed.
• Indexes are generated:
o (Probably) for fields specified with `PRIMARY KEY' or `UNIQUE' constraints in
a CREATE TABLE statement.
o For fields specified in SQL statements such as CREATE [UNIQUE] INDEX
indexname ON tablename (col [,col]...);
• Primary Indexes have unique keys.
• Secondary Indexes may have duplicates.
• An index on a column which is used in an SQL `WHERE' predicate is likely to speed up
an enquiry.
• this is particularly so when `=' is involved (equijoin)
• no improvement will occur with `IS [NOT] NULL' statements
• an index is best used on a column with widely varying data.
• indexing a column of Y/N values might slow down enquiries.
• an index on telephone numbers might be very good but an index on area code might be a
poor performer.
• Multicolumn index can be used, and the column which has the biggest range of values or
is the most frequently accessed should be listed first.
• Avoid indexing small relations, frequently updated columns, or those with long strings.
• There may be several indexes on each table. Note that partial indexing normally supports
only one index per table.
• Reading or updating a particular record should be fast.
• Inserting records should be reasonably fast. However, each index has to be updated too,
so increasing the indexes makes this slower.
• Deletion may be slow.
• particularly when indexes have to be updated.
• deletion may be fast if records are simply flagged as `deleted'.

7.3.7 Shadow Paging


The ideas proposed for implementing transactions are prefectly workable, but such an approach
would not likely be implemented in a modern system. Instead a disk block transaction technique
would more likely be used. This saves much messing around with little pieces of information,
while maintaining disk order and disk clustering.
Disk clustering is when all the data which a query would want has been stored close together on
the disk. In this way when a query is executed the DBMS can simple "scoop" up a few tracks

95
from the disk and have all the data it needs to complete the query. Without clustering, the disk
may have to move over the whole disk surface looking for bits of the query data, and this could
be hundreds of times slower than being able to get it all at once. Most DBMS systems perform
clustering techniques, either user-directed or automatically.
With shadow paging, transaction logs do not hold the attributes being changed but a copy of the
whole disk block holding the data being changed. This sounds expensive, but actually is highly
efficient. When a transaction begins, any changes to disk follow the following procedure:

1. If the disk block to be changed has been copied to the log already, jump to 3.
2. Copy the disk block to the transaction log.
3. Write the change to the original disk block.

On a commit the copy of the disk block in the log can be erased. On an abort all the blocks in the
log are copied back to their old locations. As disk access is based on disk blocks, this process is
fast and simple. Most DBMS systems will use a transaction mechanism based on shadow paging.

96
8 DATABASE SECURITY
This is the technique used for protecting the database against total or partial access to
unauthorised persons, malicious destruction or alteration, and protection against the
accidental introduction of inconsistency that integrity constraints provide.

Main objectives to consider when designing a secure database:

Types of Security
Database security addresses the following issues:
• Legal and ethical issues regarding the right to access certain information.
• Policy issues at the governmental, institutional or cooperate levels as to what kind of
information should not be made publicly available.
• System related issues e.g whether security should be handled at physical or operating
system level.
• The need in some organizations to identify multiple security levels and to categorise
data and users according to these levels e.g. top secret, secret, confidential,
unclassified.

We can classify security mechanisms into:


• Discretionary security mechanism: Used to grant privileges to users including
capability to access specific data files in a specific mode( delete, modify, update).
• Mandatory security mechanism: Used to enforce multilevel security by classifying the
data and users into various security classes and then implementing the appropriate
security policy of the organization.

Levels to Take Security Measures


• Database system: Some DB system users may be authorised to access only a limited
portion of the database. Others may be allowed to issue queries but forbidden to
modify.
• Operating system: To enforce authorization to the system.
• Network: Software level security within the network software is important both in
internet and in private networks.
• Physical: securing computer sites against armed or surreptitious entry by intruders.
• Human: Reducing the chance of any user giving access to an intruder in exchange for
bribe or other favours.
• Database Security and DBA
• The DBA’s responsibilities include granting privileges to users and classifying them
according to policy of the organization.
• The DBA has a DMA account in the DBMS also called system or superuser account
which provide capabilities no available to others.
The privileges include:
• Account creation
• Privilege granting
• Privilege revocation
• Security level assignment

97
• Audit trails:
The DB system must also keep track of all operations on the database that are applied by
a certain user throughout the login session.
A system login includes an entry for each operation applied to the database that may be
required for recovery from transaction failure or system crash.
A database audit consists of reviewing the log to examine all the accesses and operation
performed on the database during certain time period.

An audit trail is a log of all changes ( inserts, deletes, updates) to the database along with
the information such as which user performed the changes and when the changes was
performed.
The audit trail aids security n several ways.e.g. if the balance of an account is found
incorrect the bank can trace all the updates on the account to find the incorrect(
fraudulent) update.

Encryption

A DBMS can use encryption to protect information in certain situation where normal
security mechanism of the DBMS are not adequate e.g. when an intruder taps a
communication line.
The basic idea is to apply the encryption algorithm( which may be accessible to intruders)
to the original data and a user specified or DBA specified encryption key which is kept
secret.
There is also a decryption algorithm which takes the encrypted data and the encryption
key as the input and returns the original data.

Authentication

This is the task of verifying the identity of a person/ software connecting to a database.
Simplest form is the secret password which must be presented when connecting to the
database. This has some drawbacks especially over network where an eavesdropper may
be able to sniff the data being sent across network to access the username and password.
A more secure scheme involves a challenge-response system, where the user encrypts the
challenge string using a secret password as encryption key and then return the results.
The DB system can verify the authenticity of the user by decrypting the string with the
same secret password and checking the results with the original challenge string.

98

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy