Kenya Methodist University Department of Mathematics and Computer Science
Kenya Methodist University Department of Mathematics and Computer Science
Kenya Methodist University Department of Mathematics and Computer Science
INSTRUCTION MATERIALS
i
2.4.8 Replacing ternary relationships................................................................. 23
2.5 Cardinality......................................................................................................... 23
2.6 Optionality ........................................................................................................ 24
2.7 Participation ...................................................................................................... 25
2.8 Entity Sets ......................................................................................................... 25
2.8.1 Confirming Correctness ............................................................................ 25
2.8.2 Deriving the relationship parameters ........................................................ 26
2.9 Redundant relationships.................................................................................... 26
2.10 Splitting n:m Relationships............................................................................... 27
2.11 Constructing an ER model ................................................................................ 27
2.12 ER Examples..................................................................................................... 29
2.13 Problems with ER Models ................................................................................ 30
2.13.1 Fan traps.................................................................................................... 31
2.14 Chasm traps....................................................................................................... 31
2.15 Enhanced ER Models (EER) ............................................................................ 32
2.15.1 Key Constraints......................................................................................... 32
2.15.2 Participation Constraints........................................................................... 33
2.15.3 Weak Entities ............................................................................................ 33
2.15.4 Aggregation............................................................................................... 34
2.15.5 Class Hierarchies ...................................................................................... 34
2.15.6 Specialisation ............................................................................................ 35
2.15.7 Generalisation ........................................................................................... 35
3 Structured Query Language ...................................................................................... 36
3.1 Database Models............................................................................................... 36
3.2 Relational Databases......................................................................................... 36
3.3 Relational Data Structure.................................................................................. 37
3.4 Domain and Integrity Constraints..................................................................... 37
3.5 Structure of a Table........................................................................................... 37
3.5.1 Columns or Attributes............................................................................... 39
3.5.2 Basic Structure:......................................................................................... 39
3.5.3 Characteristics of Relations ...................................................................... 41
3.6 Primary Keys .................................................................................................... 41
3.7 Integrity Constraints over Relations ................................................................. 42
3.7.1 Kinds of Constraints ................................................................................. 42
3.8 SQL Basics........................................................................................................ 43
4 DATABASE DESIGN.............................................................................................. 51
4.1 Schema Refinement .......................................................................................... 51
4.1.1 Problems Caused by Redundancy............................................................. 51
4.1.2 Use of Decomposition............................................................................... 52
4.1.3 Informal Design Guidelines for Relational Schemas................................ 52
4.2 Functional Dependencies .................................................................................. 52
4.3 Normalization ................................................................................................... 53
4.4 Integrity Constraints.......................................................................................... 54
4.5 Understanding Data .......................................................................................... 54
4.5.1 First Normal Form .................................................................................... 56
ii
4.6 Decomposing the relation ................................................................................. 57
4.6.1 Second Normal Form................................................................................ 59
4.6.2 Third Normal Form................................................................................... 60
4.6.3 Normalisation - BCNF.............................................................................. 62
5 Relational Algebra .................................................................................................... 69
5.1 Set Operations - semantics................................................................................ 70
6 Concurrency using Transactions............................................................................... 75
6.1 Transactions ...................................................................................................... 75
6.1.1 Properties of transactions that a DBMS must maintain ............................ 76
6.1.2 Transaction Schedules .............................................................................. 76
6.2 Lost Update scenario......................................................................................... 77
6.3 Uncommitted Dependency................................................................................ 77
6.3.1 Serialisability ............................................................................................ 78
6.3.2 Concurrency Locking................................................................................ 79
6.3.3 Deadlock ................................................................................................... 80
6.3.4 Deadlock Handling ................................................................................... 81
6.3.5 Two-Phase Locking .................................................................................. 82
6.3.6 Other Database Consistency Methods ...................................................... 82
6.4 Crash Recovery................................................................................................. 83
6.4.1 Why Recovery is Needed.......................................................................... 84
6.4.2 Typical strategy for recovery: ................................................................... 84
6.4.3 Immediate Update..................................................................................... 86
6.4.4 Rollback .................................................................................................... 87
7 DBMS Implementation............................................................................................. 88
7.1 Implementing a DBMS ..................................................................................... 88
7.1.1 Disk and Memory ..................................................................................... 90
7.2 Disk Arrangements ........................................................................................... 90
7.2.1 Hash tables ................................................................................................ 91
7.3 Decision Support............................................................................................... 92
7.3.1 Data Warehousing..................................................................................... 92
7.3.2 Data Mining .............................................................................................. 93
7.3.3 Binary Tree ............................................................................................... 94
7.3.4 Index Structure and Access....................................................................... 94
7.3.5 Costing Index and File Access.................................................................. 94
7.3.6 Use of Indexes........................................................................................... 95
7.3.7 Shadow Paging.......................................................................................... 95
8 DATABASE SECURITY......................................................................................... 97
iii
1 Introduction to Database Systems
Databases and database management systems (DBMS) have become an essential component of
everyday life in modern society. Relational database systems have become increasingly popular
since the late 1970's. They offer a powerful method for storing data in an application-
independent manner. This means that for many enterprises the database is at the core of the I.T.
strategy. Developments can progress around a relatively stable database structure, which is
secure, reliable, efficient, and transparent.
In early systems, each suite of application programs had its own independent master file. The
duplication of data over master files could lead to inconsistent data. Efforts to use a common
master file for a number of application programs resulted in problems of integrity and security.
The production of new application programs could require amendments to existing application
programs, resulting in `unproductive maintenance'.
Data structuring techniques, developed to exploit random access storage devices, increased the
complexity of the insert, delete and update operations on data. As a first step towards a DBMS,
packages of subroutines were introduced to reduce programmer effort in maintaining these data
structures. However, the use of these packages still requires knowledge of the physical
organization of the data.
4
• In real-time and active database technology, database is used in controlling industrial and
manufacturing process
• Data warehouses and online analytical processing (OLAP) systems are used in many
companies to extract and analyze information from large databases.
• Database search techniques are being applied to the World Wide Web to improve the
search for information that is needed by users browsing through the Internet.
To understand the fundamentals of database technology, we must start from the basics of
traditional database application. The next section, therefore, define what a database is together
with the definition of other terms.
1.3 The Database Approach
A Database is logically coherent collection of related data stored in a consistent form for a
specific purpose such that information can be retrieved in an orderly, related, and meaningful
manner.
The contents of a database can hold a variety of different things. To make database design more
straightforward, databases contents are divided up into two concepts:
• Schema
• Data
The Schema is the structure of data, whereas the Data are the "facts". Schema can be complex to
understand to begin with, but really indicates the rules that the Data must obey.
Imagine a case where we want to store facts about employees in a company. Such facts could
include their name, address, date of birth, and salary. In a database all the information on all
employees would be held in a single storage "container", called a table. This table is a tabular
object like a spreadsheet page, with different employees as the rows, and the facts (e.g. their
names) as columns... Let's call this table EMP, and it could look something like in Table 1.1:
From this information the schema would define that EMP has four components,
"NAME","ADDRESS","DOB","SALARY". As designers we can call the columns what we like,
5
but making them meaningful helps. In addition to the name, we want to try and make sure that
people don't accidentally store a name in the DOB column, or some other silly error. Protecting
the database against rubbish data is one of the most important database design steps, and will be
covered in the later chapters. From what we know about the facts, we can say things like:
Such rules, also referred to as constraints, can be enforced by a database. During the design
phase of a database schema these and more complex rules are identified and where possible
implemented. The more rules the harder it is to enter poor quality data.
Database Management System (DBMS) is a collection of programs that enables users to create
and maintain database. It can also be define as a general-purpose software system that facilitates
the process of defining, constructing and maintaining database for various applications.
Defining means specifying the data types, structures and constraints as described in the example
above. Constructing is process of storing the data itself in some medium, controlled by DBMS
Manipulating involve operations such as querying to retrieve information, updating, generating
reports.
An Example
Let us consider another example of a UNIVERSITY database for maintaining information
concerning students, courses, and grades in a university environment. Figure 1.1 shows the
databse structure and a few relationships in the database.
Faculty
Registration
Entities: Assignment
students, faculty, courses,
offerings, enrollments
Relationships:
faculty teach offerings,
Grade students enroll in Course
Recording offerings, offerings made Scheduling
of courses, ...
University Database
6
record may contain the information to represent student's name, registration number, class,
courses, and major. Each COURSE record includes data to represent the Course Name, Course
Number, Credit hours, and department e.t.c.
Notice that the records in various files may be related.
7
1.4.2 Insulation between Programs and Data, and Data abstraction
In traditional file processing, the structure of data files is embedded in the access programs, so
any changes to the structure of a file may require changing all programs that access this file. By
contrast, DBMS access programs do not require such changes in most cases. The structure of
data file is stored in the DBMS catalog separately from the access programs. This property is
called program-data independence.
A DBMS provides users with a conceptual representation of data that does not include many of
the details of how the data is stored or how the operations are implemented. This is referred to as
data abstraction.
1.4.3 Support of multiple Views of Data
A database typically has many users, each of whom may require a different perspective or view
of the database. A view may be a subset of the databases or it may contain virtual data that is
derived from the database file but is not explicitly stored.
1.4.4 Sharing of Data and Multi-user Transaction Processing
A multi-user DBMS must allow multiple users to access the database at the same time. The
DBMS must include concurrency control software to ensure that several users trying to update
the same data do so in a controlled manner so that the result of the updates is correct.
1.5 User Types
When considering users of a Database system, there are three broad classes to consider:
For a small personal database, one person typically defines, constructs, and manipulates the
database. However, for large databases, many people are involved. We next describe the three
broad categories mentioned above.
1.5.1 Database Administrators
The DBA is responsible for authorizing access to the database, for coordinating and monitoring
its use, and for acquiring software and hardware resources as needed. The DBA is accountable
for problems such as breach of security or poor system response time.
• Deciding the information content of the database, i.e. identifying the entities of interest to
the enterprise and the information to be recorded about those entities.
• Deciding the storage structure and access strategy, i.e. how the data is to be represented
by writing the storage structure definition.
• Liaising with users, i.e. to ensure that the data they require is available and to write the
necessary external schemas and conceptual/external mapping.
8
• Defining authorisation checks and validation procedures. Authorisation checks and
validation procedures are extensions to the conceptual schema and can be specified using
the DDL
• Defining a strategy for backup and recovery. For example periodic dumping of the
database to a backup tape and procedures for reloading the database for backup. Use of a
log file where each log record contains the values for database items before and after a
change and can be used for recovery purposes
• Independence of data and program - This is a prime advantage of a database. Both the
database and the user program can be altered independently of each other thus saving
time and money, which would be required to retain consistency.
• Data shareability and nonredundance of data - The ideal situation is to enable
applications to share an integrated database containing all the data needed by the
applications and thus eliminate as much as possible the need to store data redundantly.
• Integrity - With many different users sharing various portions of the database, it is
impossible for each user to be responsible for the consistency of the values in the
9
database and for maintaining the relationships of the user data items to all other data item,
some of which may be unknown or even prohibited for the user to access.
• Centralised control - With central control of the database, the DBA can ensure that
standards are followed in the representation of data.
• Security - Having control over the database the DBA can ensure that access to the
database is through proper channels and can define the access rights of any user to any
data items or defined subset of the database. The security system must prevent corruption
of the existing data either accidentally or maliciously.
• Performance and Efficiency - In view of the size of databases and of demanding database
accessing requirements, good performance and efficiency are major requirements.
Knowing the overall requirements of the organisation, as opposed to the requirements of
any individual user, the DBA can structure the database system to provide an overall
service that is `best for the enterprise'.
• This is a prime advantage of a database. Both the database and the user program can be
altered independently of each other.
• In a conventional system applications are data dependent. This means that the way in
which the data is organised in secondary storage and the way in which it is accessed are
both dictated by the requirements of the application, and, moreover, that knowledge of
the data organisation and access technique is built into the application logic.
• For example, if a file is stored in indexed sequential form then an application must know
o that the index exists
o the file sequence (as defined by the index)
The internal structure of the application will be built around this knowledge. If, for example, the
file was to be replaced by a hash-addressed file, major modifications would have to be made to
the application.
Redundancy is
10
• Direct if a value is a copy of another
• Indirect if the value can be derived from other values:
o Simplifies retrieval but complicates update
o Conversely integration makes retrieval slow and updates easier
• Data redundancy can lead to inconsistency in the database unless controlled. The system
should be aware of any data duplication - the system is responsible for ensuring updates
are carried out correctly. A DB with uncontrolled redundancy can be in an inconsistent
state - it can supply incorrect or conflicting information. A given fact represented by a
single entry cannot result in inconsistency - few systems are capable of propagating
updates i.e. most systems do not support controlled redundancy.
• Inconsistencies between two entries representing the same `fact' give an example of lack
of integrity (caused by redundancy in the database).
• Integrity constraints can be viewed as a set of assertions to be obeyed when updating a
DB to preserve an error-free state.
• Even if redundancy is eliminated, the DB may still contain incorrect data.
• Integrity checks, which are important, are checks on data items and record types.
• Type checks
o e.g. ensuring a numeric field is numeric and not a character - this check should be
performed automatically by the DBMS.
• Redundancy checks
o direct or indirect- this check is not automatic in most cases.
• Range checks
o e.g. to ensure a data item value falls within a specified range of values, such as
checking dates so that say (age > 0 AND age < 110).
• Comparison checks
o in this check a function of a set of data item values is compared against a function
of another set of data item values. For example, the max salary for a given set of
employees must be less than the min salary for the set of employees on a higher
salary scale.
A record type may have constraints on the total number of occurrences, or on the insertions and
deletions of records. For example in a patient database there may be a limit on the number of
xray results for each patient or the details of a patients visit to hospital must be kept for a
minimum of 5 years before it can be deleted
11
• Centralized control of the database helps maintain integrity, and permits the DBA to
define validation procedures to be carried out whenever any update operation is
attempted (update covers modification, creation and deletion).
• Integrity is important in a database system - an application run without validation
procedures can produce erroneous data which can then affect other applications using that
data.
1.6.5 Providing Persistent Storage for Program Objects and Data Structures
This is one of the main reasons for the emergence of the object-oriented database.
Programming languages typically have complex data structures such as the class definition in
C++. The values of program variables are discarded once a program terminates, unless the
programmer explicitly stores them in the permanent files, which often involves converting these
complex structures into a format suitable for file storage. When the need arises to read this data
once more, the programmer must convert from the file format to the program variable structure.
Object oriented database systems are compatible with programming languages such as C++ and
JAVA, and the DBMS software automatically performs any necessary conversion. Thus object
oriented database systems typically offer data structure compatibility with one or more object-
oriented programming languages.
1.6.6 Permitting Inferencing and Actions Using Rules
Some databases systems provide capabilities for defining deductive rules for inferencing new
information from the stored database facts. Such systems are called deductive database
systems.
1.6.7 Providing Multiple User Interfaces
Because many types of users with varying levels of technical knowledge use a database, a
DBMS should provide a variety of user interfaces. These include query languages for casual
users, programming language interface for application programmers, forms and command codes
for parametric users, and menu driven interfaces and natural language interfaces for stand-alone
users.
1.6.8 Representing Complex Relationships Among Data
A database may include numerous varieties of data that are interrelated in many ways.
1.6.9 Providing Backup And Recovery
A DBMS must provide facilities for recovering from hardware or software failures. For example,
if the computer system fails in the middle of a complex update program, the recovery sub-system
12
is responsible for making sure that the database is restored to the state it was in before program
started executing.
Additional problems may arise if the database designers and DBA do not properly design the
database or if the database application is not implemented properly. Hence, it may be more
desirable to use regular files under the following circumstances:
• The database and application are simple, well defined, and not expected to change.
• There are stringent real-time requirements for some programs that may not be met
because of DBMS overhead.
• Multiple-user access to data is not required
The following may summarize the limitation of DBMS
• Expensive
• Complex – Administration is full-time job.
• Abstraction is not free
• Overhead of query processing
• Process optimization
• Features may not be needed
Security - single user
Concurrency – single user
Integrity – single application.
Recovery – not mission critical
• Nonprocedural Access
1.8 Database Architecture
1.8.1 Three-schema Architecture
DBMSs do not all conform to the same architecture. The three-level architecture forms the basis
of modern database architectures. The architecture for DBMSs is divided into three general
levels:
• External
• Conceptual
• Internal
13
Figure 1.2: Three level architecture
• The external level: concerned with the way individual users see the data. It expresses the
properties of program/data independence and multiple ‘views’ of database
• The conceptual level: can be regarded as a community user view a formal description of
data of interest to the organisation, independent of any storage considerations.
• The internal level: concerned with the way in which the data is actually stored
The application programmer may use a high level language ( e.g. C++) while the casual user will
probably use a query language. Regardless of the language used, it will include a data sub-
language DSL that is that subset of the language, which is concerned with storage and retrieval
of information in the database and may or may not be apparent to the user.
A DSL is a combination of two languages:
• a data definition language (DDL) - provides for the definition or description of database
objects
• a data manipulation language (DML) - supports the manipulation or processing of
database objects.
Each user sees the data in terms of an external view: Defined by an external schema, consisting
basically of descriptions of each of the various types of external record in that external view, and
also a definition of the mapping between the external schema and the underlying conceptual
schema.
14
Figure 1.3: How the three level architecture works
15
It is however at one remove from the physical level since it does not deal in terms of physical
records or blocks nor with any device specific constraints such as cylinder or track sizes. Details
of mapping to physical storage is highly implementation specific and are not expressed in the
three-level architecture.
1.8.5 Mappings
• The conceptual/internal mapping:
o defines conceptual and internal view correspondence
o specifies mapping from conceptual records to their stored counterparts
• An external/conceptual mapping:
o defines a particular external and conceptual view correspondence
• A change to the storage structure definition means that the conceptual/internal mapping
must be changed accordingly, so that the conceptual schema may remain invariant,
achieving physical data independence.
• A change to the conceptual definition means that the conceptual/external mapping must
be changed accordingly, so that the external schema may remain invariant, achieving
logical data independence.
16
2 Database Analysis
This unit it concerned with the process of taking a database specification from a customer and
implementing the underlying database structure necessary to support that specification.
2.1 Introduction
Data analysis is concerned with the NATURE and USE of data. It involves the identification of
the data elements which are needed to support the data processing system of the organization, the
placing of these elements into logical groups and the definition of the relationships between the
resulting groups.
Other approaches, e.g. D.F.Ds and Flowcharts, have been concerned with the flow of data-
dataflow methodologies. Data analysis is one of several data structure based methodologies.
Systems analysts often, in practice, go directly from fact finding to implementation dependent
data analysis. Their assumptions about the usage of properties of and relationships between data
elements are embodied directly in record and file designs and computer procedure specifications.
The introduction of Database Management Systems (DBMS) has encouraged a higher level of
analysis, where the data elements are defined by a logical model or `schema' (conceptual
schema). When discussing the schema in the context of a DBMS, the effects of alternative
designs on the efficiency or ease of implementation is considered, i.e. the analysis is still
somewhat implementation dependent.
It is fair to ask why data analysis should be done if it is possible, in practice to go straight to a
computerised system design. Data analysis is time consuming; it throws up a lot of questions.
Implementation may be slowed down while the answers are sought.
From another viewpoint, data analysis provides useful insights for general design principals
which will benefit the trainee analyst even if he finally settles for a `quick and dirty' solution.
The development of techniques of data analysis have helped to understand the structure and
meaning of data in organisations. Data analysis techniques can be used as the first step of
extrapolating the complexities of the real world into a model that can be held on a computer and
be accessed by many users. The data can be gathered by conventional methods such as
interviewing people in the organisation and studying documents. The facts can be represented as
objects of interest. There are a number of documentation tools available for data analysis, such as
entity-relationship diagrams. These are useful aids to communication, help to ensure that the
work is carried out in a thorough manner, and ease the mapping processes that follow data
analysis. Some of the documents can be used as source documents for the data dictionary.
• Database study - here the designer creates a written specification in words for the
database system to be built. This involves:
17
o Analyzing the company situation - is it an expanding company, dynamic in its
requirements, mature in nature, solid background in employee training for new
internal products, etc.
o Define problems and constraints - what is the situation currently? How does the
company deal with the task which the new database is to perform? Any issues
around the current method? What are the limits of the new system?
o Define objectives - what is the new database system going to have to do, and in
what way must it be done. What information does the company want to store
specifically, and what does it want to calculate. How will the data evolve?
o Define scope and boundaries - what is stored on this new database system, and
what it stored elsewhere. Will it interface to another database?
• Database Design - conceptual, logical, and physical design steps in taking specifications
to physical implementable designs.
• Implementation and loading - it is quite possible that the database is to run on a
machine which as yet does not have a database management system running on it at the
moment. If this is the case one must be installed on that machine. Once a DBMS has been
installed, the database itself must be created within the DBMS. Finally, not all databases
start completely empty, and thus must be loaded with the initial data set (such as the
current inventory, current staff names, current customer details, etc).
• Testing and evaluation - the database, once implemented, must be tested against the
specification supplied by the client. It is also useful to test the database with the client
using mock data, as clients do not always have a full understanding of what they thing
18
they have specified and how it differs from what they have actually asked for! In
addition, this step in the life cycle offers the chance to the designer to fine-tune the
system for best performance. Finally, it is a good idea to evaluate the database in-situ,
along with any linked applications.
• Operation - this step is where the system is actually in real usage by the company.
• Maintenance and evolution - designers rarely get everything perfect first time, and it
may be the case that the company requests changes to fix problems with the system or to
recommend enhancements or new requirements.
o Commonly development takes place without change to the database structure. In
elderly systems the DB structure becomes fossilised.
• is a design tool
• is a graphical representation of the database system
• provides a high-level conceptual data model
• supports the user's perception of the data
• is DBMS and hardware independent
• had many variants
• is composed of entities, attributes, and relationships
19
• Defers DBMS implementation decisions.
2.4.3 Entities
• An entity is any object in the system that we want to model and store information about.
Entities represent real-world objects (person, employee, student) and concepts
(department, company, course).
• Individual objects are called entities.
• Groups of the same type of objects are called entity types or entity sets
• Entities are represented by rectangles (either with round or square corners)
• There are two types of entities; weak and strong entity types.
2.4.4 Attribute
• All the data relating to an entity is held in its attributes.
• An attribute is a property of an entity.
• Each attribute can have any value from its domain.
• Each entity within an entity type:
o May have any number of attributes.
o Can have different attribute values than that in any other entity.
o Have the same number of attributes.
• Attributes can be :
o simple or composite
o single-valued or multi-valued
• Attributes can be shown on ER models
• They appear inside ovals and are attached to their entity.
• Note that entity types can have a large number of attributes... If all are shown then the
diagrams would be confusing. Only show an attribute if it adds information to the ER
diagram, or clarifies a point.
20
Stored vs. derived e.g age derived from birthdates , Bank balance derived from deposits and
withdrawals
2.4.5 Keys
• A key is a data item that allows us to uniquely identify individual occurrences or an entity
type.
• A candidate key is an attribute or set of attributes that uniquely identifies individual
occurrences or an entity type.
• An entity type may have one or more possible candidate keys, the one which is selected
is known as the primary key.
• A composite key is a candidate key that consists of two or more attributes
• The name of each primary key attribute is underlined.
• "Weak" entities Must be related to at least one strong entity ("identifying relationship”,
parent/child existence)
• Unique key is combination of entity key and "identifying relationship”
• Nulls: Not applicable, unknown value but known to exist
2.4.6 Relationships
• A relationship type is a meaningful association between entity types
• A relationship is an association of entities where the association includes one entity from
each participating entity type.
• Relationship types are represented on the ER diagram by a series of lines.
• In the original Chen notation, the relationship is placed inside a diamond, e.g. managers
manage employees:
• For this module, we will use an alternative notation, where the relationship is a label on
the line. The meaning is identical
21
2.4.7 Degree of a Relationship
• The number of participating entities in a relationship is known as the degree of the
relationship.
• If there are two entity types involved it is a binary relationship type e.g. student
enrolled_in course , instructor teaches course ,manager manages employee, et.c
• If there are three entity types involved it is a ternary relationship type e.g student
enrolled_in course offered_by school , instructor teaches course at school .Another
example of ternary relationship is given in the diagram below
Note: Don't create unnecessary ternary relationships that can be represented as binary.
• It is a relationship where the same entity participates more than once in different roles.
• In the example above we are saying that employees are managed by employees.
• If we wanted more information about who manages whom, we could introduce a second
entity type called manager.
Questions
• What are the potential problems with recursive relationships?
• Should relationship attributes be a separate entity?
• Which entity ultimately stores that attribute?
• It is also possible to have entities associated through two or more distinct relationships.
This is an example of multiple relationships.
22
• In the representation we use it is not possible to have attributes as part of a relationship.
To support this other entity types need to be developed.
• This can result in the loss of some information - It is no longer clear which sales assistant
sold a customer a particular product.
• Try replacing the ternary relationship with an entity type and a set of binary relationships.
Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as
a noun.
• So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product. This process also works for higher order relationships.
2.5 Cardinality
• Relationships are rarely one-to-one
• For example, a manager usually manages more than one employee
• This is described by the cardinality of the relationship, for which there are four possible
categories.
o One to one (1:1) relationship
o One to many (1:m) relationship
23
o Many to one (m:1) relationship
o Many to many (m:n) relationship
• On an ER diagram, if the end of a relationship is straight, it represents 1, while a "crow's
foot" end represents many.
• A one to one relationship - a man can only marry one woman, and a woman can only
marry one man, so it is a one to one (1:1) relationship
• A one to may relationship - one manager manages many employees, but each employee
only has one manager, so it is a one to many (1:n) relationship
• A many to one relationship - many students study one course. They do not study more
than one course, so it is a many to one (m:1) relationship
• A many to many relationship - One lecturer teaches many students and a student is taught
by many lecturers, so it is a many to many (m:n) relationship
2.6 Optionality
A relationship can be optional or mandatory.
• If the relationship is mandatory an entity at one end of the relationship must be related to
an entity at the other end.
• The optionality can be different at each end of the relationship. For example, a student
must be on a course. This is mandatory. To the relationship `student studies course' is
mandatory.
• But a course can exist before any students have enrolled. Thus the relationship `course
is_studied_by student' is optional.
• To show optionality, put a circle or `0' at the `optional end' of the relationship.
• As the optional relationship is `course is_studied_by student', and the optional part of this
is the student, then the `O' goes at the student end of the relationship connection.
24
• It is important to know the optionality because you must ensure that whenever you create
a new entity it has the required mandatory links
2.7 Participation
Partial or total ("existence dependency").
student, faculty, course, department - what are the relationships?
25
• Use the diagram to show all possible relationship scenarios.
• Go back to the requirements specification and check to see if they are allowed.
• If not, then put a cross through the forbidden relationships
• This allows you to show the cardinality and optionality of the relationship
• Consider entities `customer' (customer details), `address' (the address of a customer) and
`distance' (distance from the company to the customer address).
26
2.10 Splitting n:m Relationships
A many to many relationship in an ER model is not necessarily incorrect. They can be replaced
using an intermediate entity. This should only be done where:
Consider the case of a car hire company. Customers hire cars, one customer hires many card and
a car is hired by many customers.
The many to many relationship can be broken down to reveal a `hire' entity, which contains an
attribute `date of hire'.
Before beginning to draw the ER model, read the requirements specification carefully. Document
any assumptions you need to make.
• Identify entities - list all potential entity types. These are the object of interest in the
system. It is better to put too many entities in at this stage and them discard them later if
necessary.
• Remove duplicate entities - Ensure that they really separate entity types or just two names
for the same thing.
• List the attributes of each entity (all properties to describe the entity which are relevant to
the application).
27
• Mark the primary keys.
o Which attributes uniquely identify instances of that entity type?
o This may not be possible for some weak entities.
• Define the relationships
o Examine each entity type to see its relationship to the others.
• Describe the cardinality and optionality of the relationships
o Examine the constraints between participating entities.
• Remove redundant relationships
Figure: ER Notation
28
ER modelling is an iterative process, so draw several versions, refining each one until you are
happy with it. Note that there is no one right answer to the problem, but some solutions are better
than others!
Naming Conventions
• Nouns for entity types and attributes.
• Verbs for relationships.
• Links between entities and relationships can be named with the "role" being played by
the entity in that relationship.
o A student entity plays the role of teaching Assistant for course.
o A faculty entity plays the role of advisor for student.
Pathways
• Identify and give names to the "things" being modeled, i.e., identify the entities.
• List the attributes for the entities.
• Based on the semantics, list the candidate keys and select a primary key (except for weak
entities).
• Discover and give active descriptions to relationships between entities.
• Find "identifying relationships" for weak entities by looking for "owner" entity, situations
where one entity type is information about another entity type.
• Finish off some details: look for total participation, cardinality, role names, etc.
2.12 ER Examples
A Country Bus Company owns a number of busses. Each bus is allocated to a particular route,
although some routes may have several busses. Each route passes through a number of towns.
One or more drivers are allocated to each stage of a route, which corresponds to a journey
through some or all of the towns on a route. Some of the towns have a garage where busses are
kept and each of the busses are identified by the registration number and can carry different
numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each
route is identified by a route number and information is available on the average number of
passengers carried per day for each route. Drivers have an employee number, name, address, and
sometimes a telephone number.
Entities
• Bus - Company owns busses and will hold information about them.
• Route - Buses travel on routes and will need described.
• Town - Buses pass through towns and need to know about them
• Driver - Company employs drivers, personnel will hold their data.
• Stage - Routes are made up of stages
• Garage - Garage houses buses, and need to know where they are.
Relationships
29
• A route comprises of one or more stages.
• route-stage (1:m) comprises
• One or more drivers are allocated to each stage.
• driver-stage (m:1) is allocated
• A stage passes through some or all of the towns on a route.
• stage-town (m:n) passes-through
• A route passes through some or all of the towns
• route-town (m:n) passes-through
• Some of the towns have a garage
• garage-town (1:1) is situated
• A garage keeps buses and each bus has one `home' garage
• garage-bus (m:1) is garaged
• Bus (reg-no,make,size,deck,no-pass)
• Route (route-no,avg-pass)
• Driver (emp-no,name,address,tel-no)
• Town (name)
• Stage (stage-no)
• Garage (name,address)
There are several problems that may arise when designing a conceptual data model. These are
known as connection traps.
1. fan traps
30
2. chasm traps
A single site contains many departments and employs many staff. However, which staff work in
a particular department?
The fan trap is resolved by restructuring the original ER model to represent the correct
association.
A chasm trap occurs when a model suggests the existence of a relationship between entity types,
but the pathway does not exist between certain entity occurrences.
It occurs where there is a relationship with partial participation, which forms part of the pathway
between entities that are related.
• A single branch is allocated many staff who oversee the management of properties for
rent. Not all staff oversee property and not all property is managed by a member of staff.
• What properties are available at a branch?
• The partial participation of Staff and Property in the oversees relation means that some
properties cannot be associated with a branch office through a member of staff.
• We need to add the missing relationship which is called `has' between the Branch and the
Property entities.
• You need to therefore be careful when you remove relationships which you consider to
be redundant.
31
2.15 Enhanced ER Models (EER)
The basic concepts of ER modelling is not powerful enough for some complex applications... We
require some additional semantic modelling concepts:
• Specialisation
• Generalisation
• Categorisation
• Aggregation
• Superclass - an entity type that includes distinct subclasses that require to be represented
in a data model.
• Subclass - an entity type that has a distinct role and is also a member of a superclass.
Subclasses need not be mutually exclusive; a member of staff may be a manager and a sales
person.
The purpose of introducing superclasses and subclasses is to avoid describing types of staff with
possibly different attributes within a single entity. This could waste space and you might want to
make some attributes mandatory for some types of staff but other staff would not need these
attributes at all.
32
2.15.2 Participation Constraints
PC specifies whether the existence of an entity depends on its being related to another entity via
relationship type.
E.g. A requirement that each department should have a manager.
The participation of the entity set Departments in the relationship set Manages is said to be total,
meaning that every entity in the “total set” of employee entity must be related to a department
entity via works for.
A participation that is not total is said to be partial, meaning that some parts of employees entity
is related to department entity via manages.
2.15.3 Weak Entities
These are entities that do not have key attributes of their own.:
Aggregation
One limitation with ER model is that it cannot express relationships among relationships.
Aggregation is an abstraction through which relationships are treated as higher level entities
It allows us to indicate that a relationship set ( identified through dashed box) participates in
another relationship set.
We use aggregation when we want to express relationship among relationships.
Aggregation
Class Hierarchies
Sometimes it is natural to classify the entities in an entity set into subclasses.
E.g. Consider Hourly_Emps and Contract_Emps entity sets.
33
We say that the attributes for the entity set employees is inherited by the entity of Hourly_Emps
and that Hourly_Emps ISA Employees
2.15.4 Aggregation
•One limitation with ER model is that it cannot express relationships among relationships.
•Aggregation is an abstraction through which relationships are treated as higher level entities
–It allows us to indicate that a relationship set (identified through dashed box) participates in
another relationship set.
34
–Covering constraints determine whether the entities in the subclasses collectively include all
entities in the superclass.
2.15.6 Specialisation
This is the process of maximising the differences between members of an entity by identifying
their distinguishing characteristics.
• Staff(staff_no,name,address,dob)
• Manager(bonus)
• Secretary(wp_skills)
• Sales_personnel(sales_area, car_allowance)
• Here we have shown that the manages relationship is only applicable to the Manager
subclass, whereas the works_for relationship is applicable to all staff.
• It is possible to have subclasses of subclasses.
2.15.7 Generalisation
car(regno,colour,make,model,numSeats)
motorbike(regno,colour,make,model,hasWindshield)
And forming:
vehicle(regno,colour,make,model,numSeats,hasWindshielf)
In this case vehicle has numSeats which would be NULL if the vehicle was a motorbike, and has
hasWindshield which would be NULL if it was a car.
35
3 Structured Query Language
In the other chapters of this course consideration is given to producing a good design for a
database structure or schema. In this chapter the focus is on applying this schema to a database
management system, and then using that DBMS to allow storage and retrieval of data.
To communicate with the database system itself we need a language. We use the SQL for
illustration; however, other DBMSs (Oracle, Access) can also be used. SQL is an international
standard language for manipulating relational databases. It is based on an IBM product. SQL is
short for Structured Query Language.
SQL can create schemas, delete them, and change them. It can also put data into schemas and
remove data. It is a data handling language, but it is not a programming language.
SQL is a DSL (Data Sub Language), which is really a combination of two languages. These are
the Data Definition Language (DDL) and the Data Manipulation Language (DML). Schema
changes are part of the DDL, while data changes are part of the DML. We will consider both
parts of the DSL in this discussion of SQL.
• hierarchic
• network
• relational
Models other than the relational database module used to be quite popular. Each model type is
appropriate to particular types of problem. The Relational model type is the most popular in use
today, and the other types are not discussed further.
36
SQL is an ISO language based on relational algebra. Relational algebra is a mathematical
formulation
3.3 Relational Data Structure
A relational data structure is a collection of tables or relations.
• Each relation is a table with rows and columns, thus, a relation is a collection of rows or
tuples .
• This tabular representation is simple and permits the use of queries to the data
• A tuple is a collection of columns or attributes
• A domain is a pool of values from which the actual attribute values are taken.
Accounts Relations:
Account number Branch name Balance
A-101 Downtown 500
A-102 Westlands 400
A-201 Parklands 900
A-215 Meru 750
A-217 Kisii 700
A-222 Embu 350
• A row in a table represents a relationship among a set of values.
• Table is a collection of such relationships.
• We follow the terminology of relational models.
• Table headers are referred to as attributes.
• For each attribute there is a set of permitted values called the domain of that attribute.
37
• Because tables are relations, we use mathematical terms relation and tuple in place if
table and row.
• For all relations, the domains of all attributes should be atomic, i.e. elements of the
domain are considered indivisible.
Consider another example of a table holding "drivers" and a table holding "car" information...
Each car is owned by a driver, and therefore there is a link between "car" and "driver" to indicate
which driver owns which car.
In the subsequent pages we will refer back to this driver and car arrangement. To make the
examples easier, lets create some example data.
Car
The CAR table has the following structure:
DRIVER
The DRIVER table has the following structure:
The DRIVER and the CAR has a relationship between them of N:1. This indicates that a CAR
can have only 1 DRIVER, but that a DRIVER can own more than 1 CAR simultaneously.
In the design section we can see that this requires a FOREIGN KEY in the CAR end of the
relationship. This foreign key allows us to implement the relationship in the database. We will
call this field OWNER.
NAME DOB
Jim Otti 11 Jan 1980
Bob Kinuthia 23 Mar 1981
Bob Fondo 3 Dec 1986
38
CAR
REGNO MAKE COLOUR PRICE OWNER
F611 AAA FORD RED 12000 Jim Smith
J111 BBB SKODA BLUE 11000 Jim Smith
A155 BDE MERCEDES BLUE 22000 Bob Smith
K555 GHT FIAT GREEN 6000 Bob Jones
SC04 BFE SMART BLUE 13000
integer decimal
character text data
-- the range of values can be further constrained
If a column of a row contains no data, we say it is NULL. For example, a car just off the
production line might not have an owner in the database until someone buys the car. A NULL
value may also indicate that the value is unavailable or inappropriate . This might be the case for
a car which is being destroyed or a car where two people are arguing in court that they are both
the owner.
• All rows of a table must be different in some way from all other rows.
• Sometimes a row is referred to as a Tuple.
• Cardinality is the number of ROWS in a table.
• Arity is the number of COLUMNS in a table.
39
Notations:
40
• Schema describes the relation’s name, the name of each field( or column or attribute), and the
domain of each field.
• An instance of a relation is a set of tuples (also called records) in which each tuple has the
same number of fields as the schema.
A table requires a key which uniquely identifies each row in the table. This is entity integrity.
41
The key could have one column, or it could use all the columns. It should not use more columns
than necessary. A key with more than one column is called a composite key.
A table may have several possible keys, the candidate keys, from which one is chosen as the
primary key.
If the rows of the data are not unique, it is necessary to generate an artificial primary key.
In our example, DRIVER has a primary key of NAME, and CAR has a primary key of REGNO.
This database will break if there are two drivers with the same name, but it gives you an idea
what the primary key means...
•An integrity constraint (IC) is a condition that is specified on a database schema and restricts
the data that can be stored in an instance of a database.
•Legal instance: A databases instance that satisfies all the ICs.
•A DBMS enforces ICs in that it permits only legal instances to be stored in a database
When to specify ICs
•Domain constraints.
•Key constraints.
•Foreign Key constraint
•General Constraints
Domain constraints: Specifies an important condition that we want each instance of a relation to
satisfy: Each attribute A must be atomic. Values that appear in a column must be drawn from the
domain associated with that column.
•Primary Key constraints: A statement that a certain minimal subset of the field of a relation is
a unique identifier for a tuple.
E.g. For a student relation there should be a constraint that no two students should have the same
student ID.
42
•Candidate key (key): Set of fields that uniquely identifies a tuple according to a key constraints
•By convention, the attribute that form the primary key of a relation schema are underlined.
•A relation may have several candidate keys
•E.g. {sid, name}, and {login, sid} in the STUDENT relation are both candidate keys.
•The keys must identify the tuples uniquely in all the instances of a relation.
Foreign Key constraints
•Sometimes, information stored in a relation is linked to information stored in another relation
•If one relation is modified the other one should be checked for consistency.
General constraints
In the remainder of this section only simple SELECT statements are considered.
Simple SELECT
43
This would produce all the rows from the specified table, but only for the particular column
mentioned. If you want more than one column shown, you can put in multiple columns
separating them with commas, like:
If you want to see all the columns of a particular table, you can type:
REGNO
F611 AAA
J111 BBB
A155 BDE
K555 GHT
SC04 BFE
COLOUR OWNER
RED Jim Smith
BLUE Jim Smith
BLUE Bob Smith
GREEN Bob Jones
BLUE
In SQL, you can put extra space characters and return characters just about anywhere without
changing the meaning of the SQL. SQL is also case-insensitive (except for things in quotes). In
addition, SQL in theory should always end with a ';' character. You need to include the ';' if you
have two different SQL queries so that the system can tell when one SQL statement stops and
another one starts. If you forget the ';' the online interface will put one in for you. For these
reasons all of the following statements are identical and valid.
44
SELECT REGNO FROM CAR;
SELECT
regno
FROM car;
Comments
Sometimes you might want to write a comment in somewhere as part of an SQL statement. A
comment in this case is a simple piece of text which is meaningful to yourself, but should be
ignored by the database. The characters '--', when they appear in a query, indicate the start of a
comment. Everything after that point is ignored until the end of that line. The following queries
are all equivalent.
SELECT regno
FROM car;
Warning: You cannot put a comment immediately after a ';'. Comments are only supported
within the text of an SQL statement. The following will cause SQL errors:
SELECT regno
FROM car; -- Error here as comment is after the query
SELECT filters
Displaying all the rows of a table can be handy, but if we have tables with millions of rows then
this type of query could take hours. Instead, we can add "filters" onto a SELECT statement to
only show specific rows of a table. These filters are written into an optional part of the SELECT
statement, known as a WHERE clause.
SELECT columns
FROM table
WHERE rule
45
The "rule" section of the WHERE clause is checked for every row that a select statement would
normally show. If the whole rule is TRUE, then that row is shown, whereas if the rule is FALSE,
then that row is not shown.
The rule itself can be quite complex. The simplest rule is a single equality test, such as
"COLOUR = 'RED'".
From the database we know that only F611 AAA is RED, and the rest of the cars are either
BLUE or GREEN. Thus a rule COLOUR = 'RED' is only true on the row with F611 AAA, and
false elsewhere. With everything in a query:
An important point to note is that queries are case sensitive between the quotes. Thus 'RED' will
work, but 'red' will produce nothing. The case used in the quotes must match perfectly the case
stored in the table. SQL is not forgiving and if you forget you can be scratching you head for
hours trying to fix it.
Note also that "colour" does not have to appear on the SELECT line as a column name. It can if
you want to see the colour, but there is no requirement for it to be there. Therefore this will work
too:
REGNO COLOUR
F611 AAA RED
Comparisons
46
SQL supports a variety of comparison rules for use in a WHERE clause. These include =,!=,<>,
<, <=, >, and >=.
Note that when dealing with strings, like RED, you must say 'RED'. When dealing with numbers,
like 10000, you can say '10000' or 10000. The choice is yours.
Dates
Date rules are some of the hardest rules to get right in writing SQL, yet there is nothing
particularly complex about them. The hard part is working out what it means to be GREATER
THAN a particular date.
NAME DOB
Jim Smith 11 Jan 1980
Bob Smith 23 Mar 1981
Bob Jones 3 Dec 1986
NAME DOB
Bob Jones 3 Dec 1986
In other comparators, it is important to realise that a date gets bigger as you move into the future,
and smaller as you move into the past. Thus to say 'DATE1 < DATE2' you are stating that
DATE1 occurs before DATE2 on a calender. For example, to find all drivers who were born on
or after the 1st Jan 1981 you would do:
NAME DOB
47
Bob Smith 23 Mar 1981
Bob Jones 3 Dec 1986
The syntax for dates does change slightly on difference database systems, but the syntax '1 Jan
2000' works in general on all systems. Oracle also allows dates like '1-Jan-2000' and '1-Jan-00'.
If you specify a year using only the last two digits, Oracle uses the current date to compute the
missing parts of the year, converting '00' to '2000'. Do not get confused by saying '87' for '1987'
and ending up with '2087'!
BETWEEN
Sometimes when you are dealing with dates you want to specify a range of dates to check. The
best way of doing this is using BETWEEN. For instance, to find all the drivers born between
1995 and 1999 you could use:
NAME DOB
Bob Jones 3 Dec 1986
Note that the dates have day of the month and month in them, and not just the year. In SQL, all
dates must have a month and a year. If you try to use just a year the query will fail.
BETWEEN works for other things, not just dates. For instance, to find cars worth between 5000
and 10000, you could execute:
SELECT regno
FROM car
where price between 5000 and 10000;
REGNO PRICE
K555 GHT
NULL
The NULL value indicates that something has no real value. For this reason the normal value
comparisons will always fail if you are dealing with a NULL. If you are looking for NULL, for
instance looking for cars without owners using OWNER of CAR, all of the following are wrong!
Instead SQL has a special comparison operator called IS which allows us to find NULL values.
There is also an opposite to IS, called IS NOT, which finds all the values which are not NULL.
48
So finding all the regnos of cars with current owners would be (note that if they have an owner,
then the owner has a value and thus is NOT NULL):
REGNO
F611 AAA
J111 BBB
A155 BDE
K555 GHT
LIKE
When dealing with strings, sometimes you do not want to match on exact strings like ='RED',
but instead on partial strings, substrings, or particular patterns. This could allow you, for
instance, to find all cars with a colour starting with 'B'. The LIKE operator provides this
functionality.
The LIKE operator is used in place of an '=' sign. In its basic form it is identical to '='. For
instance, both of the following statements are identical:
The power of LIKE is that it supports two special characters, '%' and '-'. These are equivalent to
the '*' and '?' wildcard characters of DOS. Whenever there is an '-' character in the string, any
character will match. Whenever there is an '%' character in the string, 0 or more characters will
match. Consider these rules:
49
Note however that LIKE is more powerful than a simple '=' operator, and thus takes longer to
run. If you are not using any wildcard characters in a LIKE operator then you should always
replace LIKE with '='.
50
4 DATABASE DESIGN
In this chapter, we shall consider:
• Schema Refinement
• Functional Dependencies
• Normal Forms
• Decompositions
• Normalization
• Other Types of Functional Dependencies
4.1 Schema Refinement
Conceptual database design gives a set of relational schemas and integrity constraints (ICs).This
initial design must be refined by taking the ICs into account more carefully.
These constraints include:
•Functional Dependencies (Most important)
•Multivalued Dependencies
•Join Dependencies
Problems that schema refinement is intended to address:
Redundancy
•One method of eliminating redundancy is decomposition.
•Decomposition also has some problems.
4.1.1 Problems Caused by Redundancy
• Redundant storage: Some information is stored repeatedly
• Update anomalies:If one copy of such repeated data is updated, an inconsistency is created.
• Insertion anomalies: it may not be possible to store some information unless some other
information is stored as well.
• Deletion anomalies: It may not be possible to delete some information without losing some
information as well.
An Example: Hourly_Emps Relation
•Hourly_Emps(ssn, name, lot, rating, hourly_wages, hours_worked)
51
We cannot insert a tuple for an employee unless we know the hourly wage for the employee’s
rating value(an insertion anomaly)
If we delete all tuples with a given rating value, we lose the association between rating and
hourly_rating(a deletion anomaly)
If a schema is in one of the normal forms then we know that no problem can exist.
Considering the normal form can help us decide whether to decompose it or not.
Properties of decomposition
• Lossless join: Enables us to recover any instance of the decomposed relation from the
corresponding instance of the smaller relation.
• Dependency preservation: Enables us to enforce a constraint on the original relation by
simply enforcing it in the smaller relation.
Problems of Decomposition
Queries over the original relation may require us to join the decomposed relations.
52
A functional dependency (FD) is a kind of IC btw two sets of attributes that generalizes the
concept of a key.
4.3 Normalization
What is normalisation?
Normalisation is the process of taking data from a problem and reducing it to a set of relations
while ensuring data integrity and eliminating data redundancy
• Data integrity - all of the data in the database are consistent, and satisfy all integrity
constraints.
• Data redundancy – if data in the database can be found in two different locations (direct
redundancy) or if data can be calculated from other data items (indirect redundancy) then
the data is said to contain redundancy.
Data should only be stored once and avoid storing data that can be calculated from other data
already held in the database. During the process of normalisation redundancy must be removed,
but not at the expense of breaking data integrity rules.
If redundancy exists in the database then problems can arise when the database is in normal
operation:
• When data is inserted the data must be duplicated correctly in all places where there is
redundancy. For instance, if two tables exist for in a database, and both tables contain the
employee name, then creating a new employee entry requires that both tables be updated
with the employee name.
• When data is modified in the database, if the data being changed has redundancy, then all
versions of the redundant data must be updated simultaneously. So in the employee
example a change to the employee name must happen in both tables simultaneously.
The removal of redundancy helps to prevent insertion, deletion, and update errors, since the data
is only available in one attribute of one table in the database.
The data in the database can be considered to be in one of a number of `normal forms'. Basically
the normal form of the data indicates how much redundancy is in that data. The normal forms
have a strict ordering:
There are other normal forms, such as 4th and 5th normal forms. They are rarely utilised in
system design and are not considered further here.
To be in a particular form requires that the data meets the criteria to also be in all normal forms
before that form. Thus to be in 2nd normal form the data must meet the criteria for both 2nd
normal form and 1st normal form. The higher the form the more redundancy has been eliminated.
53
4.4 Integrity Constraints
An integrity constraint is a rule that restricts the values that may be present in the database. The
relational data model includes constraints that are used to verify the validity of the data as well as
adding meaningful structure to it:
• Entity integrity:
The rows (or tuples) in a relation represent entities, and each one must be uniquely identified.
Hence we have the primary key that must have a unique non-null value for each row.
• Referential integrity:
This constraint involves the foreign keys. Foreign keys tie the relations together, so it is vitally
important that the links are correct. Every foreign key must either be null or its value must be the
actual value of a key in another relation.
4.5 Understanding Data
Sometimes the starting point for understanding data is given in the form of relations and
functional dependencies. This would be the case where the starting point in the process was a
detailed specification of the problem. We already know what relations are. Functional
dependencies are rules stating that given a certain set of attributes (the determinant) determines a
second set of attributes.
The definition of a functional dependency looks like A->B. In this case B is a single attribute but
it can be as many attributes as required (for instance, X->J,K,L,M). In the functional
dependency, the determinant (the left hand side of the -> sign) can determine the set of attributes
on the right hand side of the -> sign. This basically means that A selects a particular value for B,
and that A is unique. In the second example X is unique and selects a particular set of values for
J,K,L, and M. It can also be said that B is functionally dependent on A. In addition, a particular
value of A ALWAYS gives you a particular value for B, but not vice-versa.
Consider this example:
R(matric_no, firstname, surname, tutor_number, tutor_name)
tutor_number -> tutor_name
Here there is a relation R, and a functional dependency that indicates that:
There is actually a second functional dependency for this relation, which can be worked out from
the relation itself. As the relation has a primary key, then given this attribute you can determine
all the other attributes in R. This is an implied functional dependency and is not normally listed
in the list of functional dependents.
54
Extracting understanding
It is possible that the relations and the determinants have not yet been defined for a problem, and
therefore must be calculated from examples of the data. Consider the following Student table.
55
Flattened Tables
Note that the student table shown above explicitly identifies the repeating group. It is also
possible that the table presented will be what is called a flat table, where the repeating group is
not explicitly shown:
Student #2 - Flattened Table
The table still shows the same data as the previous example, but the format is different. We have
removed the repeating group (which is good) but we have introduced redundancy (which is bad).
Sometimes you will miss spotting the repeating group, so you may produce something like the
following relation for the Student data.
Student(matric_no, name, date_of_birth, subject, grade )
matric_no -> name, date_of_birth
name, date_of_birth -> matric_no
This data does not explicitly identify the repeating group, but as you will see the result of the
normalisation process on this relation produces exactly the same relations as the normalisation of
the version that explicitly does have a repeating group.
4.5.1 First Normal Form
• First normal form (1NF) deals with the `shape' of the record type
• A relation is in 1NF if, and only if, it contains no repeating attributes or groups of
attributes.
• Example:
• The Student table with the repeating group is not in 1NF
• It has repeating groups, and it is called an `unnormalised table'.
Relational databases require that each row only has a single value per attribute, and so a
repeating group in a row is not allowed.
To remove the repeating group, one of two things can be done:
56
Flatten table and Extend Primary Key
The Student table with the repeating group can be written as:
If the repeating group was flattened, as in the Student #2 data table, it would look something
like:
Although this is an improvement, we still have a problem. matric_no can no longer be the
primary key - it does not have an unique value for each row. So we have to find a new primary
key - in this case it has to be a compound key since no single attribute can uniquely identify a
row. The new primary key is a compound key (matrix_no + subject).
We have now solved the repeating groups problem, but we have created other complications.
Every repetition of the matric_no, name, and data_of_birth is redundant and liable to produce
errors.
With the relation in its flattened form, strange anomalies appear in the system. Redundant data is
the main cause of insertion, deletion, and updating anomalies.
Insertion anomaly:
With the primary key including subject, we cannot enter a new student until they have at least
one subject to study. We are not allowed NULLs in the primary key so we must have an entry in
both matric_no and subject before we can create a new record.
• This is known as the insertion anomaly. It is difficult to insert new records into the
database.
• On a practical level, it also means that it is difficult to keep the data up to date.
Update anomaly
If the name of a student were changed for example Smith, J. was changed to Green, J. this would
require not one change but many one for every subject that Smith, J. studied.
Deletion anomaly
If all of the records for the `Databases' subject were deleted from the table,we would
inadvertently lose all of the information on the student with matric_no 960145. This would be
the same for any student who was studying only one subject and the subject was deleted. Again
this problem arises from the need to have a compound primary key.
• The alternative approach is to split the table into two parts, one for the repeating groups
and one of the non-repeating groups.
57
• the primary key for the original relation is included in both of the new relations
Record
matric_no subject grade
960100 Databases C
960100 Soft_Dev A
960100 ISDE D
960105 Soft_Dev B
960105 ISDE B
... ... ...
960150 Workshop B
Student
matric_no name date_of_birth
960100 Smith,J 14/11/1977
960105 White,A 10/05/1975
960120 Moore,T 11/03/1970
960145 Smith,J 09/01/1972
960150 Black,D 21/08/1973
58
Student and Record are now in First Normal Form.
4.6.1 Second Normal Form
Second normal form (or 2NF) is a more stringent normal form defined as:
A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully functionally
dependent on the whole key.
Thus the relation is in 1NF with no repeating groups, and all non-key attributes must depend on
the whole key, not just some part of it. Another way of saying this is that there must be no partial
key dependencies (PKDs).
The problems arise when there is a compound key, e.g. the key to the Record relation -
matric_no, subject. In this case it is possible for non-key attributes to depend on only part of the
key - i.e. on only one of the two key attributes. This is what 2NF tries to prevent.
Consider again the Student relation from the flattened Student #2 table:
Student(matric_no, name, date_of_birth, subject, grade )
A dependency diagram is used to show how non-key attributes relate to each part or combination
of parts in the primary key.
59
All attributes in each relation are fully
functionally dependent upon its primary key
• A relation is in 3NF if, and only if, it is in 2NF and there are no transitive functional
dependencies
• Transitive functional dependencies arise:
• when one non-key attribute is functionally dependent on another non-key attribute:
• FD: non-key attribute -> non-key attribute
• and when there is redundancy in the database
By definition transitive functional dependency can only occur if there is more than one non-key
field, so we can say that a relation in 2NF with zero or one non-key field must automatically be
in 3NF.
60
attributes to the right of the “->” are not in the key with at least one actually being in the
relation.
• Data redundancy arises from this
• we duplicate address if a manager is in charge of more than one project
• causes problems if we had to change the address- have to change several entries, and this
could lead to errors.
• The solution is to eliminate transitive functional dependency by splitting the table
• create two relations - one with the transitive dependency in it, and another for all of the
remaining attributes.
• split Project into Project and Manager.
• the determinant attribute becomes the primary key in the new relation
• manager becomes the primary key to the Manager relation
• the original key is the primary key to the remaining non-transitive attributes
• in this case, project_no remains the key to the new Projects table.
Project
project_no manager
p1 Black,B
p2 Smith,J
p3 Black,B
p4 Black,B
Manager manager address
Black,B 32 High Street
Smith,J 11 New Street
Summary: 1NF
61
Summary: 2NF
Summary: 3NF
Overview
• When a relation has more than one candidate key, anomalies may result even though the
relation is in 3NF.
• 3NF does not deal satisfactorily with the case of a relation with overlapping candidate
keys
• i.e. composite candidate keys with at least one attribute in common.
• BCNF is based on the concept of a determinant.
• A determinant is any attribute (simple or composite) on which some other attribute is
fully functionally dependent.
• A relation is in BCNF is, and only if, every determinant is a candidate key.
62
Here, the first determinant suggests that the primary key of R could be changed from a,b to a,c. If
this change was done all of the non-key attributes present in R could still be determined, and
therefore this change is legal. However, the second determinant indicates that a,d determines b,
but a,d could not be the key of R as a,d does not determine all of the non key attributes of R (it
does not determine c). We would say that the first determinate is a candidate key, but the second
determinant is not a candidate key, and thus this relation is not in BCNF (but is in 3rd normal
form).
Normalisation to BCNF - Example 1
DB(Patno,PatName,appNo,time,doctor)
Patno -> PatName
Patno,appNo -> Time,doctor
Time -> appNo
Now we have to decide what the primary key of DB is going to be. From the information we
have, we could chose:
DB(Patno,PatName,appNo,time,doctor) (example 1a)
or
DB(Patno,PatName,appNo,time,doctor) (example 1b)
Example 1a - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)
2NF Eliminate partial key dependencies
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
63
• BCNF Every determinant is a candidate key
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
• Go through all determinates where ALL of the left hand attributes are present in a
relation and at least ONE of the right hand attributes are also present in the relation.
• Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
• Patno,appNo -> Time,doctor
All LHS present, and time and doctor also present, so relevant. Is this a candidate key?
Patno,appNo IS the key, so this is a candidate key. Thus this is OK for BCNF
compliance.
• Time -> appNo
Time is present, and so is appNo, so relevant. Is this a candidate key. If it was then we
could rewrite DB as:
DB(Patno,appNo,time,doctor)
This will not work, as you need both time and Patno together to form a unique key. Thus
this determinate is not a candidate key, and therefore DB is not in BCNF. We need to fix
this.
• BCNF: rewrite to
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
time is enough to work out the appointment number of a patient. Now BCNF is satisfied, and the
final relations shown are in BCNF.
Example 1b - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
64
• R1(Patno,PatName)
• R2(time,appNo)
• Go through all determinates where ALL of the left hand attributes are present in a
relation and at least ONE of the right hand attributes are also present in the relation.
• Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
• Patno,appNo -> Time,doctor
Not all LHS present, so not relevant.
• Time -> appNo
Time is present, and so is appNo, so relevant. This is a candidate key. However, Time is
currently the key for R2, so satisfies the rules for BCNF.
• BCNF: as 3NF
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
Summary - Example 1
This example has demonstrated three things:
• BCNF is stronger than 3NF, relations that are in 3NF are not necessarily in BCNF
• BCNF is needed in certain situations to obtain full understanding of the data model
• there are several routes to take to arrive at the same set of relations in BCNF.
• Unfortunately there are no rules as to which route will be the easiest one to take.
Example 2
Grade_report(StudNo,StudName,(Major,Adviser,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))
• Functional dependencies
• Unnormalised
Grade_report(StudNo,StudName,(Major,Advisor,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
65
StudCourse(StudNo,Major,CourseNo,
Ctitle,InstrucName,InstructLocn,Grade)
• 2NF Remove partial key dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName,InstructLocn)
• 3NF Remove transitive dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)
BCNF
Student(StudNo,StudName)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)
StudMajor(StudNo,Advisor)
Adviser(Adviser,Major)
Problems BCNF overcomes
STUDENT MAJOR ADVISOR
123 PHYSICS EINSTEIN
123 MUSIC MOZART
456 BIOLOGY DARWIN
789 PHYSICS BOHR
999 PHYSICS EINSTEIN
• If the record for student 456 is deleted we lose not only information on student 456 but
also the fact that DARWIN advises in BIOLOGY
66
• we cannot record the fact that WATSON can advise on COMPUTING until we have a
student majoring in COMPUTING to whom we can assign WATSON as an advisor.
• Now that we have reached the end of the normalisation process, you must go back and
compare the resulting relations with the original ER model
• You may need to alter it to take account of the changes that have occurred during the
normalisation process Your ER diagram should always be a prefect reflection of the
model you are going to implement in the database, so keep it up to date!
• The changes required depends on how good the ER model was at first!
Normalisation Example
Library
Consider the case of a simple video library. Each video has a title, director, and serial number.
Customers have a name, address, and membership number. Assume only one copy of each video
exists in the library. We are given:
video(title,director,serial)
customer(name,addr,memberno)
hire(memberno,serial,date)
title->director,serial
serial->title
serial->director
name,addr -> memberno
memberno -> name,addr
serial,date -> memberno
What normal form is this?
67
• No repeating groups, so at least 1NF
• 2NF? There is a composite key in hire. Investigate further... Can memberno in hire be
found with just serial or just date. NO. Therefore relation is in at least 2NF.
• 3NF? serial->director is a non-key dependency. Therefore the relations are currently in
2NF.
serial(serial,director)
Determinants are:
serial->director Candidate key
serial in BCNF
customer(name,addr,memberno)
Determinants are:
name,addr -> memberno Candidate key
memberno -> name,addr Candidate key
customer in BCNF
hire(memberno,serial,date)
Determinants are:
serial,date -> memberno Candidate key
hire in BCNF
Therefore the relations are also now in BCNF.
68
5 Relational Algebra
In order to implement a DBMS, there must exist a set of rules which state how the database
system will behave. For instance, somewhere in the DBMS must be a set of statements which
indicate than when someone inserts data into a row of a relation, it has the effect which the user
expects. One way to specify this is to use words to write an `essay' as to how the DBMS will
operate, but words tend to be imprecise and open to interpretation. Instead, relational databases
are more usually defined using Relational Algebra.
The relational algebra is a procedural query language that consists of a set of operations that take
one or two relations as input and produce a new relation as their result. Each query describes a
step-by-step procedure for computing the desired answer. Operators may be unary(operate on
one relation) or binary.
Relational Algebra is :
Operators in relational algebra are not necessarily the same as SQL operators, even if they have
the same name. For example, the SELECT statement exists in SQL, and also exists in relational
algebra. These two uses of SELECT are not the same. The DBMS must take whatever SQL
statements the user types in and translate them into relational algebra operations before applying
them to the database.
Terminology
69
Operators - Write
• INSERT - provides a list of attribute values for a new tuple in a relation. This operator is
the same as SQL.
• DELETE - provides a condition on the attributes of a relation to determine which tuple(s)
to remove from the relation. This operator is the same as SQL.
• MODIFY - changes the values of one or more attributes in one or more tuples of a
relation, as identified by a condition operating on the attributes of the relation. This is
equivalent to SQL UPDATE.
Operators - Retrieval
There are two groups of operations:
Relational SELECT
SELECT is used to obtain a subset of the tuples of a relation that satisfy a select condition.
For example, find all employees born after 1st Jan 1950:
SELECTdob '01/JAN/1950'(employee)
Relational PROJECT
The PROJECT operation is used to select a subset of the attributes of a relation by specifying the
names of the required attributes.
For example, to get a list of all employees surnames and employee numbers:
PROJECTsurname,empno(employee)
• UNION of R and S
the union of two relations is a relation that includes all the tuples that are either in R or in
S or in both R and S. Duplicate tuples are eliminated.
70
• INTERSECTION of R and S
the intersection of R and S is a relation that includes all tuples that are both in R and S.
• DIFFERENCE of R and S
the difference of R and S is the relation that contains all the tuples that are in R but that
are not in S.
UNION Example
Figure : UNION
INTERSECTION Example
Figure : Intersection
71
DIFFERENCE Example
Figure : DIFFERENCE
CARTESIAN PRODUCT
The Cartesian Product is also an operator which works on two sets. It is sometimes called the
CROSS PRODUCT or CROSS JOIN.
It combines the tuples of one relation with all the tuples of the other relation.
• In its simplest form the JOIN operator is just the cross product of the two relations.
• As the join becomes more complex, tuples are removed within the cross product to make
the result of the join more meaningful.
• JOIN allows you to evaluate a join condition between the attributes of the relations on
which the join is undertaken.
72
JOIN Example
Figure : JOIN
Natural Join
Invariably the JOIN involves an equality test, and thus is often described as an equi-join. Such
joins result in two attributes in the resulting relation having exactly the same value. A `natural
join' will remove the duplicate attribute(s).
• In most systems a natural join will require that the attributes have the same name to
identify the attribute(s) to be used in the join. This may require a renaming mechanism.
• If you do use natural joins make sure that the relations do not have two attributes with the
same name by accident.
OUTER JOINs
Notice that much of the data is lost when applying a join to two relations. In some cases this lost
data might hold useful information. An outer join retains the information that would have been
lost from the tables, replacing missing data with nulls.
There are three forms of the outer join, depending on which data is to be kept.
73
Figure : OUTER JOIN (left/right)
OUTER JOIN example 2
74
6 Concurrency using Transactions
The goal in a `concurrent' DBMS is to allow multiple users to access the database simultaneously
without interfering with each other.
A problem with multiple users using the DBMS is that it may be possible for two users to try and
change data in the database simultaneously. If this type of action is not carefully controlled,
inconsistencies are possible.
To control data access, we first need a concept to allow us to encapsulate database accesses.
Such encapsulation is called a `Transaction'.
6.1 Transactions
Introduction
We can classify a database system according to number of users who can use the system
concurrently.
• Single user
• Multi-user
Multiple users can access database and use computer simultaneously because of concept
of multiprogramming, which allows a computer to execute many programs or processes.
Concurrent execution of processes is actually interleaved.
o Beginning.
o Sequence of read and write operations.
o Ending - either committed or aborted.
75
After work is performed in a transaction, two outcomes are possible:
• Commit - Any changes made during the transaction by this transaction are committed to
the database.
• Abort - All the changes made during the transaction by this transaction are not made to
the database. The result of this is as if the transaction was never started.
A transaction schedule is a tabular representation of how a set of transactions were executed over
time. This is useful when examining problem scenarios. Within the diagrams various
nomenclatures are used:
76
Consider transaction A, which loads in a bank account balance X (initially 20) and adds 10
pounds to it. Such a schedule would look like this:
Time Transaction A
t1 TOTAL:=READ(X)
t2 TOTAL:=TOTAL+10
t3 WRITE(TOTAL,X)
Now consider that, at the same time as transaction A runs, transaction B runs. Transaction B
gives all accounts a 10% increase. Will X be 32 or 33?
Value Value
Time Transaction A Transaction B
TOTAL BALANCE
t1 BALANCE:=READ(X) 20
t2 TOTAL:=READ(X) 20
t3 TOTAL:=TOTAL+10 30
t4 WRITE(TOTAL,X) 30
t5 BALANCE:=BALANCE*110% 22
t6 WRITE(BALANCE,X) 22
Whoops... X is 22! Depending on the interleaving, X can also be 32, 33, or 30. Lets classify
erroneous scenarios.
Inconsistency
77
Action SUM
t1 40 50 30 SUM:=READ(X) 40
t2 40 50 30 SUM+=READ(Y) 90
t3 40 50 30 ACC1=READ(Z)
t4 40 50 20 WRITE(ACC1-10,Z)
t5 40 50 20 ACC2=READ(X)
t6 50 50 20 WRITE(AC2+10,X)
t7 50 50 20 COMMIT
t7 50 50 20 SUM+=READ(Z) 110
SUM should have been 120...
6.3.1 Serialisability
Precedence Graph
In order to know that a particular transaction schedule can be serialized, we can draw a
precedence graph. This is a graph of nodes and vertices, where the nodes are the transaction
names and the vertices are attribute collisions.
The schedule is said to be serialised if and only if there are no cycles in the resulting diagram.
Precedence Graph : Method
To draw one;
Example 1
Consider the following schedule:
78
Time TRAN1 TRAN2
t1 READ(A)
t2 READ(B)
t3 READ(A)
t4 READ(B)
t5 WRITE(x,B)
t6 WRITE(y,B)
Example 2
Consider the following schedule:
Many systems use locking mechanisms for concurrency control. When a transaction needs an
assurance that some object will not change in some unpredictable manner, it acquires a lock on
that object.
79
• A transaction holding a read lock is permitted to read an object but not to change it.
• More than one transaction can hold a read lock for the same object.
• Usually, only one transaction may hold a write lock on an object.
• On a transaction schedule, we use `S' to indicate a shared lock, and `X' for an exclusive
write lock.
• A DBMS must be able to ensure that only serializable, recoverable schedules are
allowed.
• No action of committed transaction are lost while undoing aborted transactions.
• A DBMS uses locking protocol to achieve this.
• A locking protocol is a set of rules to be followed b y each transaction( and enforced y
the DBMS), in order to ensure that even though actions of several transactions might be
interleaved, the net effect is identical to executing all transactions in some serial order.
6.3.3 Deadlock
Deadlock can arise when locks are used, and causes all related transactions to WAIT forever...
time Transaction A Transaction B Lock State
X Y
t1 WRITE(p,X) -=X -
80
t2 WRITE(q,Y) X -=X
t3 READ(Y) (WAIT) X X
t3 ...WAIT... READ(X) (WAIT) X X
t3 ...WAIT... ...WAIT... X X
The `lost update' senario results in deadlock with locks. So does the `inconsistency' scenario.
• Deadlock avoidance
o pre-claim strategy used in operating systems
o not effective in database environments.
o Difficult to be omniscient.
o Reduces concurrency.
o Some transactions might never start - livelock (starvation).
• Preempt transactions that might lead to a deadlock. When lock conflicts arise:
• Wait die: only older transactions wait for younger transactions to finish. Young
transactions waiting on older transactions are killed and later restarted.
• Wound wait: young transactions wait for old to finish. Old transactions waiting on
younger are killed and later restarted.
• Using these rules, some transaction will always be making progress.
• No deadlock possible since we kill any transaction that might cause deadlock.
• Deadlock detection
o whenever a lock requests a wait, or on some perodic basis.
81
o if a transaction is blocked due to another transaction, make sure that that
transaction is not blocked on the first transaction, either directly or indirectly via
another transaction.
o Periodically* check for deadlock - if it has occurred, break it.
o Construct a wait-for graph. Nodes are transactions, edges from Ti to Tj if Ti is
waiting on a lock held by Tj.
o Any cycles in graph indicate a deadlock.
o *Could trigger inspection by ‘watchdog’ protocol.
Deadlock Resolution
If a set of transactions is considered to be deadlocked:
• Before operating on any item, a transaction must acquire at least a shared lock on that
item. Thus no item can be accessed without first obtaining the correct lock.
• After releasing a lock, a transaction must never go on to acquire any more locks.
The technical names for the two phases of the locking protocol are the `lock-acquisition phase'
and the `lock-release phase'.
6.3.6 Other Database Consistency Methods
Two-phase locking is not the only approach to enforcing database consistency. Another method
used in some DMBS is timestamping. With timestamping, there are no locks to prevent
transactions seeing uncommitted changes, and all physical updates are deferred to commit time.
• locking synchronises the interleaved execution of a set of transactions in such a way that
it is equivalent to some serial execution of those transactions.
• timestamping synchronises that interleaved execution in such a way that it is equivalent
to a particular serial order - the order of the timestamps.
82
Timestamping rules
The following rules are checked when transaction T attempts to change a data item. If the rule
indicates ABORT, then transaction T is rolled back and aborted (and perhaps restarted).
• If T attempts to read a data item which has already been written to by a younger
transaction then ABORT T.
• If T attempts to write a data item which has been seen or written to by a younger
transaction then ABORT T.
If transaction T aborts, then all other transactions which have seen a data item written to by T
must also abort. In addition, other aborting transactions can cause further aborts on other
transactions. This is a `cascading rollback'.
If the database is in an inconsistent state, it is necessary to recover to a consistent state. The basis
of recovery is to have backups of the data in the database.
83
• The recovery manager of a DBMS is responsible for ensuring transaction atomicity and
durability.
• Atomicity- by undoing the actions of transactions that do not commit.
• Durability: by making sure that all actions of committed transactions survive system
crash and media failures (e.g. a disk is corrupted)
• When s DBMS is restarted after crash, the recovery manager is given control and must bring
the database to a consistent state.
• Recovery manager also responsible for undoing actions of an aborted transaction.
• The transaction manager of a DBMS controls execution of transactions.
• Atomic Writes: Writing a page to disk in anatomic action.
• Whenever a transaction is submitted to a DBMS for execution, the system is responsible for
making sure that:
• All the operations in a transaction are completed successfully and their effect is recorded
permanently in the database.
• The transaction has no effect whatsoever on the data base or any other transaction
• This may not be achieve if the transaction fails after executing some of its operations but
before executing all of them.
Types of Failures
• A computer failure (System crash): Hardware, software or network error occurs during
transaction execution.
• A transaction or system error: Some operations in a transaction may cause it to fail e.g.
integer overflow or division by zero.
• Local errors or exception conditions detected by transaction: conditions that occur during
executing that may require cancellation of the transaction.e.g insufficient account balance
may cause a transaction such as withdrawal to be cancelled.
• Concurrency control enforcement: The CC method may decide to abort the transaction, to be
restarted later, because it violates serializability.
• Disk failure: due to read write protection or read write head crash. May happen during
read/write operation.
• Physical problems and catastrophes: include power or air conditioning failure, fire theft,
sabotage, overwriting disks e.t.c.
Recovery from a transaction failure means that the database is restored to the most recent
consistent state just before the time of failure. The system must keep information about the
changes that were applied to data items by the various transactions. This information is typically
stored in a system log
• If there is extensive damage e.g. catastrophic failures, the recovery method a past copy of
the database that was backed up to archival storage( typically tape) and reconstructs a
84
more current state by reapplying or redoing the operations of committed transactions
from the backed up log.
• When the database is not physically damaged, the strategy is to reverse any changes that
caused the inconsistency by undoing some operations.
Deferred update: Do not physically update the database on disk until after a transaction
reaches its commit point, then the updates are recorded in the database
Immediate Update: The database may be updated by some operations before the
operation reaches its commit point.
• records information about the progress of transactions in a log since the last consistent
state.
• the database therefore knows the state of the database before and after each transaction.
• every so often database is returned to a consistent state and the log may be truncated to
remove committed transactions.
• when the database is returned to a consistent state the process is often referred to as
`checkpointing'.
• The disks are physically or logically damaged then recovery from the log is impossible
and instead a restore from a dump is needed.
• If the disks are OK then the database consistency must be maintained. Writes to the disk
which was in progress at the time of the failure may have only been partially done.
• Parse the log file, and where a transaction has been ended with `COMMIT' apply the data
part of the log to the database.
• If a log entry for a transaction ends with anything other than COMMIT, do nothing for
that transaction.
• flush the data to the disk, and then truncate the log to zero.
• the process or reapplying transaction from the log is sometimes referred to as
`rollforward'.
85
6.4.3 Immediate Update
• While a transaction runs, changes made by that transaction can be written to the database
at any time. However, the original and the new data being written must both be stored in
the log BEFORE storing it on the database disk.
• On a commit:
• All the updates which has not yet been recorded on the disk is first stored in the log file
and then flushed to disk.
• The new data is then recorded in the database itself.
• On an abort, REDO all the changes which that transaction has made to the database disk
using the log entries.
• On a system restart after a failure, REDO committed changes from log.
Example
Using immediate update, and the transaction TRAN1 again, the process is:
Time Action LOG
t1 START -
t2 read(A) -
t3 write(10,B) Was B == 6, now 10
t4 write(20,C) Was C == 2, now 20
t5 COMMIT COMMIT
86
If the DMBS fails and is restarted:
• The disks are physically or logically damaged then recovery from the log is impossible
and instead a restore from a dump is needed.
• If the disks are OK then the database consistency must be maintained. Writes to the disk
which was in progress at the time of the failure may have only been partially done.
• Parse the log file, and where a transaction has been ended with `COMMIT' apply the
`new data' part of the log to the database.
• If a log entry for a transaction ends with anything other than COMMIT, apply the `old
data' part of the log to the database.
• flush the data to the disk, and then truncate the log to zero.
6.4.4 Rollback
The process of undoing changes done to the disk under immediate update is frequently referred
to as rollback.
• Where the DBMS does not prevent one transaction from reading uncommitted
modifications (a `dirty read') of another transaction (i.e. the uncommitted dependency
problem) then aborting the first transaction also means aborting all the transactions which
have performed these dirty reads.
• as a transaction is aborted, it can therefore cause aborts in other dirty reader transactions,
which in turn can cause other aborts in other dirty reader transaction. This is referred to
as `cascade rollback'.
87
7 DBMS Implementation
7.1 Implementing a DBMS
A database management system handles the requests generated from the SQL interface,
producing or modifying data in response to these requests. This involves a multilevel processing
system.
• Parser: The SQL must be parsed and tokenised. Syntax errors are reported back to the
user. Parsing can be time consuming, so good quality DBMS implementations cache
queries after they have been parsed so that if the same query is submitted again the
cached copy can be used instead. To make the best use of this most systems use
placeholders in queries, like:
• SELECT empno FROM employee where surname = ?
The '?' character is prompted for when the query is executed, and can be supplied
separately from the query by the API used to inject the SQL. The parameter is not part of
the parsing process, and thus once this query is parsed once it need not be parsed again.
88
• Executer: This takes the SQL tokens and basically translates it into relational algebra.
Each relational algebra fragment is optimised, and the passed down the levels to be acted
on.
• User: The concept of the user is required at this stage. This gives the query context, and
also allows security to be implemented on a per-user basis.
• Transactions: The queries are executed in the transaction model. The same query from
the same user can be executing multiple times in different transactions. Each transaction
is quite separate.
• Tables : The idea of the table structure is controlled at a low level. Much security is based
on the concept of tables, and the schema itself is stored in tables, as well as being a set of
tables itself.
• Table cache: Disks are slow, yet a disk is the best way of storing long-term data. Memory
is much faster, so it makes sense to keep as much table information as possible in
memory. The disk remains synchronised to memory as part of the transaction control
system.
• Disks : Underlying almost all database systems is the disk storage system. This provides
storage for the DBMS system tables, user information, schema definitions, and the user
data itself. It also provides the means for transaction logging.
The 'user' context is handled in a number of different ways, depending on the database system
being used. The following diagram gives you an idea of the approach followed by two different
systems, Oracle and MySQL.
89
In both approaches, tables in other tablespaces can be accessed. MySQL effectively sees a
tablespace and a database being the same concept, but in Oracle the two ideas are kept slightly
more separate. However, the syntax remains the same. Just as you can access column owner of
table CAR, if it is in your own tablespace, by saying
SELECT car.owner FROM car;
You can access table CAR in another tablespace (lets call it vehicles) by saying:
SELECT vehicles.car.owner FROM vehicles.car;
The appearance of this structure is similar in concept to the idea of file directories. In a database
the directories are limited to "folder.table.column", although "folder" could be a username, a
tablename, or a database, depending on the philosophy of the database management system.
Even then, the concept is largely similar.
7.1.1 Disk and Memory
The tradeoff between the DBMS using Disk or using main memory should be understood...
Issue Main Memory VS Disk
Speed Main memory is at least 1000 times faster than Disk
Storage
Disk can hold hundreds of times more information than memory for the same cost
Space
When the power is switched off, Disk keeps the data, main memory forgets
Persistence
everything
Access Time Main memory starts sending data in nanoseconds, while disk takes milliseconds
Block Size Main memory can be accessed 1 word at a time, Disk 1 block at a time
The DBMS runs in main memory, and the processor can only access data which is currently in
main memory. The handling of the differences between disk and main memory effectively is at
the heart of a good quality DBMS.
7.2 Disk Arrangements
Efficient processing of the DBMS requests requires efficient handling of disk storage. The
important aspects of this include:
• Index handling
• Transaction Log management
• Parallel Disk Requests
• Data prediction
With indexing, we are concerned with finding the data we actually want quickly and efficiently,
without having to request and read more disk blocks than absolutely necessary. There are many
approaches to this, but two of the more important ones are hash tables and binary trees.
When handling transaction logs, the discussion we have had so far has been on the theory of
these techniques. In practice, the separation of data and log is much more blurred. We will look
at one technique for implementing transaction logging, known as shadow paging.
Finally, the underlying desire of a good DBMS is to never be in a position where no further work
can be done until the disk gives us some data. Instead, by using prediction, prefetching, and
parallel disk operations, it is hoped that CPU time becomes the limiting factor.
90
7.2.1 Hash tables
A Hash table is one of the simplest index structures which a database can implement. The major
components of a hash index is the "hash function" and the "buckets". Effectively the DBMS
constructs an index for every table you create that has a PRIMARY KEY attribute, like:
The algorithm splits the places which the rows are to be stored into areas. These areas are called
buckets. If a row's primary key matches the requirements to be stored in that bucket, then that is
where it will be stored. The algorithm to decide which bucket to use is called the hash function.
For our example we will have a nice simple hash function, where the bucket number equals the
primary key. When the index is created we have to also decide how many buckets there are. In
this example we have decided on 4.
Now we can find id 3 quickly and easily by visiting bucket 3 and looking into it. But now the
buckets are full. To add more values we will have to reuse buckets. We need a better hash
function based on mod 4. Now bucket 1 holds ids (1,5,9...), bucket 2 holds (2,6,10...), etc.
91
Figure : Hash Table with collisions
We have had to put more than 1 row in some of the buckets. This is called a hash collision. The
more collisions we have the longer the collision chain and the slower the system will get. For
instance, finding id 6 means visiting bucket 2, and then finding id 2, then 10, and then finally 6.
In DBMS systems we can usually ask for a hash index for a table, and also say how many
buckets we thing we will need. This approach is good when we know how many rows there is
likely to be. Most systems will handle the hash table for you, adding more buckets over time if
you have made a mistake. It remains a popular indexing technique.
•Definitions
•Data warehousing: Consolidate data from many sources in one large repository.
–Loading periodic synchronization of replicas.
•OLAP: A term used to analyze complex data from the data warehouse.
–Complex SQL queries and views
–Queries based on spreadsheet-style operations and multidimensional view of data.
–Use distributed computing capabilities for analyses that require more storage and processing
power.
•Data Mining: Exploratory search for interesting trends and anomalies.
•Decision Support systems (DSS)/ Executive information systems(EIS): support an
organization’s leading decision makers with high-level data for complex and important
decisions.
–Loading periodic synchronization of replicas.
•OLTP(Online transaction processing): A Supported by traditional databases and includes
insertions, updates, and deletions, while also supporting information query requirements.
92
•Contain consolidated data from many sources, augmented with summary information and
covering a long time period.
•Has a clear distinction with traditional databases which are transactional( relational, object-
oriented, network or hierarchical).
•Data warehouses are mainly intended for decision support applications and are optimized for
data retrieval , not routine transaction processing.
Order of Magnitudes
•Enterprise-wide data warehouses: huge projects requiring massive investment of time and
resources.
•Virtual data warehouses: provide views of operational databases that are materialized for
efficient access.
•Data marts: generally targeted for a subset of the organization such as department, and more
tightly focused.
–Related to exploratory data analysis in statistics and knowledge discovery and machine learning
in artificial intelligence.
•Knowledge discovery in databases (KDD) encompasses more than data mining .
–It includes includes data selection, data cleansing, enrichment, data transformation or encoding,
data mining and display of the discovered information.
93
•Prediction: Showing how some attributes within the data will behave in future.
•Identification: Data patterns can be used to identify the existence of an item, an event or an
activity.
•Classification: Different classes or categories can be identified based on combination of
parameters.
•Optimization: Optimize the use of limited resources such as time, space , money to maximize
output variables.
• the lowest level in the index has one entry for each data record.
• the index is created dynamically as data is added to the file.
• as data is added the index is expanded such that each record requires the same number of
index levels to reach it (thus the tree stays `balanced').
• the records can be accessed via an index or sequentially.
Each index node in a B+ Tree can hold a certain number of keys. The number of keys is often
referred to as the `order'. Unfortunately, `Order 2' and `Order 1' are frequently confused in the
database literature. For the purposes of our coursework and exam, `Order 2' means that there can
• The top level of an index is usually held in memory. It is read once from disk at the start
of queries.
• Each index entry points to either another level of the index, a data record, or a block of
data records.
• The top level of the index is searched to find the range within which the desired record
lies.
• The appropriate part of the next level is read into memory from disc and searched.
• This continues until the required data is found.
• The use of indices reduce the amount of file which has to be searched.
• The major cost of accessing an index is associated with reading in each of the
intermediate levels of the index from a disk (milliseconds).
94
• Searching the index once it is in memory is comparatively inexpensive (microseconds).
• The major cost of accessing data records involves waiting for the media to recover the
required blocks (milliseconds).
• Some indexes mix the index blocks with the data blocks, which means that disk accesses
can be saved because the final level of the index is read into memory with the associated
data records.
• A DBMS may use different file organisations for its own purposes.
• A DBMS user is generally given little choice of file type.
• A B+ Tree is likely to be used wherever an index is needed.
• Indexes are generated:
o (Probably) for fields specified with `PRIMARY KEY' or `UNIQUE' constraints in
a CREATE TABLE statement.
o For fields specified in SQL statements such as CREATE [UNIQUE] INDEX
indexname ON tablename (col [,col]...);
• Primary Indexes have unique keys.
• Secondary Indexes may have duplicates.
• An index on a column which is used in an SQL `WHERE' predicate is likely to speed up
an enquiry.
• this is particularly so when `=' is involved (equijoin)
• no improvement will occur with `IS [NOT] NULL' statements
• an index is best used on a column with widely varying data.
• indexing a column of Y/N values might slow down enquiries.
• an index on telephone numbers might be very good but an index on area code might be a
poor performer.
• Multicolumn index can be used, and the column which has the biggest range of values or
is the most frequently accessed should be listed first.
• Avoid indexing small relations, frequently updated columns, or those with long strings.
• There may be several indexes on each table. Note that partial indexing normally supports
only one index per table.
• Reading or updating a particular record should be fast.
• Inserting records should be reasonably fast. However, each index has to be updated too,
so increasing the indexes makes this slower.
• Deletion may be slow.
• particularly when indexes have to be updated.
• deletion may be fast if records are simply flagged as `deleted'.
95
from the disk and have all the data it needs to complete the query. Without clustering, the disk
may have to move over the whole disk surface looking for bits of the query data, and this could
be hundreds of times slower than being able to get it all at once. Most DBMS systems perform
clustering techniques, either user-directed or automatically.
With shadow paging, transaction logs do not hold the attributes being changed but a copy of the
whole disk block holding the data being changed. This sounds expensive, but actually is highly
efficient. When a transaction begins, any changes to disk follow the following procedure:
1. If the disk block to be changed has been copied to the log already, jump to 3.
2. Copy the disk block to the transaction log.
3. Write the change to the original disk block.
On a commit the copy of the disk block in the log can be erased. On an abort all the blocks in the
log are copied back to their old locations. As disk access is based on disk blocks, this process is
fast and simple. Most DBMS systems will use a transaction mechanism based on shadow paging.
96
8 DATABASE SECURITY
This is the technique used for protecting the database against total or partial access to
unauthorised persons, malicious destruction or alteration, and protection against the
accidental introduction of inconsistency that integrity constraints provide.
Types of Security
Database security addresses the following issues:
• Legal and ethical issues regarding the right to access certain information.
• Policy issues at the governmental, institutional or cooperate levels as to what kind of
information should not be made publicly available.
• System related issues e.g whether security should be handled at physical or operating
system level.
• The need in some organizations to identify multiple security levels and to categorise
data and users according to these levels e.g. top secret, secret, confidential,
unclassified.
97
• Audit trails:
The DB system must also keep track of all operations on the database that are applied by
a certain user throughout the login session.
A system login includes an entry for each operation applied to the database that may be
required for recovery from transaction failure or system crash.
A database audit consists of reviewing the log to examine all the accesses and operation
performed on the database during certain time period.
An audit trail is a log of all changes ( inserts, deletes, updates) to the database along with
the information such as which user performed the changes and when the changes was
performed.
The audit trail aids security n several ways.e.g. if the balance of an account is found
incorrect the bank can trace all the updates on the account to find the incorrect(
fraudulent) update.
Encryption
A DBMS can use encryption to protect information in certain situation where normal
security mechanism of the DBMS are not adequate e.g. when an intruder taps a
communication line.
The basic idea is to apply the encryption algorithm( which may be accessible to intruders)
to the original data and a user specified or DBA specified encryption key which is kept
secret.
There is also a decryption algorithm which takes the encrypted data and the encryption
key as the input and returns the original data.
Authentication
This is the task of verifying the identity of a person/ software connecting to a database.
Simplest form is the secret password which must be presented when connecting to the
database. This has some drawbacks especially over network where an eavesdropper may
be able to sniff the data being sent across network to access the username and password.
A more secure scheme involves a challenge-response system, where the user encrypts the
challenge string using a secret password as encryption key and then return the results.
The DB system can verify the authenticity of the user by decrypting the string with the
same secret password and checking the results with the original challenge string.
98