Database Systems
Database Systems
Database Systems
Information
Data:
Raw facts; data constitute the building blocks of information
Unprocessed information.
Raw data must be properly formatted for storage, processing and presentation.
Information:
- Result of processing raw data to reveal its meaning.
Accurate, relevant, and timely information is key to good decision making
Good decision making is the key to organizational survival in a global environment.
Data Management: Discipline that focuses on the proper generation, storage and retrieval of data.
Data Mgt. is a core activity for any business, govt. agency service or organization or charity.
Why store the data as raw facts?
Historical Roots:
Files and File Systems
Although file system as a way of managing data are now largely obsolete, there are several good
reasons for studying them in detail:
- understanding file system characteristics make database design easier to understand
- awareness of problems with file system helps prevent similar problems in DBMS
- Knowledge of file systems is helpful if you plan to convert an obsolete file system to a
DBMS.
In recent past, a manager of almost any small org. was (sometimes still is) able to keep track of
necessary data by using manual file system. Such a file system was traditionally composed of
collection of file folders each properly tagged and kept in filing cabinet. Unfortunately, report
generation from a manual file system can be slow and cumbersome.
File and File Systems:
Data: Raw facts e.g telephone number, birth date, customer name and year-to-date (YTD) sales
value. Data have little meaning unless they have been organized in some logical manners. Smallest
piece of data that can be recognized by the computer is a single character, such as letter A, the
number 5 or a symbol as /. A single character requires 1 byte of computer storage.
Field: A character or group of characters (alphabetic or numeric) that has a specific meaning. A
field is used to define and store data.
Record: A logically connected set of one or more fields that describe a person, place or thing e.g
the fields that constitutes a record for a customer names, address, phone number, Date of Birth.
File: a collection of related records e.g a file might contain data about vendors of ROBCOR
Company or a file might contain records for the students currently enrolled at UEL.
Historical Roots: Files and File Systems
A simple file system where each department has multiple programs to directly access the data
Note: No separation as will be seen in a DBMS
As the number of files increased, a small file system evolved. Each file in the system used its own
application programs to store, retrieve and modify data. And each file has owned by the individual
or the dept. that commissioned its creation.
As the file system grew, the demand for the data processing specialists programming skills grew
even faster and the DP specialist was authorized to hire additional programmers. The size of the file
system also requires a larger more complex computer. The new computer and the additional
programming staff caused the DP specialist to spend less time programming and more time
managing technical and human resources. Therefore the DP specialists job evolved into that of a
Data Processing (DP) Manager who supervised a DP dept.
File-based System: A collection of application programs that perform services for the end-users
such as the production of reports. Each program defines and manages its own data.
Problems with File System Data Mgt.
- Data Redundancy: multiple file locations/copies could lead to Update, Insert and Delete
anomalies.
- Structural Dependencies/Data Dependence
Access to a file depends on its structure
Marketing changes in existing file structure is difficult
File structure changes require modifications in all programs that use data in that file
Different program languages have different file structures
Modifications are likely to produce errors, requiring additional time to debug the program
Programs written in (3GL): example of 3GL re COBOL, BASIC, FORTRAN
Programmer must specify task and how its done.
Modern databases use (4GL) allow users to specify what must be done without saying how
it must be done. 4GL are used data retrieval (such as query by example and report generator
tools) and can work with different DBMS. The need to write 3GL program to produce even
the simplest reports makes ad hoc queries impossible.
- Security features such as effective password protection, the ability to lock out parts of files or
parts of the system itself and other measures designed to safeguard data confidentiality are difficult
to program and therefore often omitted in a file system envt.
To Summarize the Limitations of File System Data Mgt.
Requires extensive programming.
There are no ad hoc query capabilities.
System administration can be complex and difficult.
Difficult to make changes to existing structures.
Security features are likely to be inadequate.
Limitations of File-based Approach
Separation and Isolation of data: when data is isolated in separate files, it is more difficult to
access data that should be available.
Duplication of Data: Owing to the decentralised approach taken by each dept. file-based approach
encouraged if not necessitated, the uncontrolled duplication of data.
Duplication is wasteful, it costs time and money to enter data more than once
It takes up additional storage space, again with associated costs.
Duplication can lead to loss of data integrity.
Data Dependence: the physical structure and storage of data files and records are defined in the
duplication code. This means that changes to an existing structure are difficult to make.
Incompatible file formats
Fixed queries/ Proliferation of application programs: File-based system are very dependent
upon the application developer, who has to write any queries or reports that are required.
No provision for security or integrity
Recovery in the event of hardware/software failure was limited or non-existent
Access to the files was restricted to one user at a time- there was no provision for shared
access by staff in same dept.
Introducing DB and DBMS
DB (Database)
shared, integrated computer structure that stores:
End user data (raw facts)
Metadata (data about data) through which the end-user data the integrated and managed.
Metadata provide a description of the data characteristic and the set of relationship that link
the data found within that the database. Database resembles a very well-organized electronic filling
cabinet in which powerful software (DBMS) helps manage the cabinets contents..
DBMS (database management system):
Collection of programs that manages database structure and controls access to the data
Possible to share data among multiple applications or users
Makes data management more efficient and effective.
Roles:
DBMS serves as the intermediary between the user and the DB. The DBMS receives all application
requests and translates them into complex operations required to fulfil those requests.
DBMS hides much of the DBs internal complexity from the application programs and users.
DBMS uses System Catalog
A detailed data dictionary that makes access to the system tables (metadata) which describes the
database.
Typically stores:
names, types, and sizes of data items
constraints on the data
names of authorized users
data items accessible by a user and the type of access
usage statistics (ref. optimisation)
Role & Advantages of the DBMS
End users have better access to more and better-managed data
Promotes integrated view of organizations operations
Minimized Data Inconsistency Probability of data inconsistency is greatly reduced
Improved Data Access Possible to produce quick answers to ad hoc queries
Particularly important when compared to previous historical DBMSs.
Improved Data Sharing: Create an envt. In which end users have better access to more and
better- managed data such access makes it possible for end users to respond quickly to
changes in their envt.
Better Data Integration: wider access to well-managed data promotes an integrated view
of the organisations operations and a clearer view of the big picture.
Data Inconsistency exists when different versions of the same data appear in different
places.
Increased end-user Productivity: Availability of data combined with tools that transform
data into usable information, empowers end-users to make quick informed decisions that can
be the different between success and failure in global economy.
First DBMS enables data in the DB to be shared among multiple applications or users.
Second it integrates the many diff users views of the data into a single all encompassing data
repository because data are crucial raw material from which information is derived, you must have
a good way of managing such data.
Role & Advantages of the DBMS
DBMS serves as intermediary between the user/applications and the DB, compare this to the
previous file based systems.
Database System: refers to an org. of components that define and regulate the collection, storage,
mgt and use of data within a DB envt.
Types of Databases
Can be classified by users:
Single User: supports only one user at a time. If user A is using the DB, user B and C must
wait until user A is done.
Multiuser: supports multiple users at the same time. When multiuser DB supports a
relatively small number of users (usually fewer than 50) dept. within organization (Work
Group Database). When the DB is used by entire organization and supports many users
across many depts. (Enterprise DB)
Can be classified by location:
Centralized: Supports data located at a single site
Distributed: Supports data distributed across several sites
Can be classified by use:
Transactional (or production): OLTP
Supports a companys day-to-day operations
Data Warehouse: OLAP
Stores data used to generate information required to make tactical or strategic decisions
Often used to store historical data
Structure is quite different
History of Database Systems
First-generation
Hierarchical and Network
Second generation
Relational
Third generation
Object-Oriented
Object-Relational
DBMS FUNCTIONS
DBMS performs several important functions that guarantee the integrity and consistency of the data
in the DB.
These include:
1. Data Dictionary MGT.: Stores definitions of the data elements and their relationship
(metadata) in a data dictionary. DBMS uses data dictionary to look up the required data
components structure and relationships, thus relieving you from having to code such
complex relationships in each program.
2. Data Storage Mgt: DBMS creates and manages the complex structure required for data
storage, thus relieving you of the difficult task of defining and programming the physical
data characteristic.
3. Data Transformation and Presentation: It transforms entered data to conform to required
data structure. DBMS relieves you of the chore of making a distribution between the logical
data format and physical data format.
4. Security Mgt. DBMS creates a security system that enforces user security and data privacy.
Security rules determine which users can access the DB, which data items each user can
access and which data operations (read, add, delete or modify) the user can perform.
5. Multiuser Access Control: To provide data integrity and data consistency.
6. Backup and Recovery Mgt: To ensure data safety and integrity. It allows DBA to perform
routine and special backup and recovery procedures.
7. Data Integrity Mgt: Promotes and enforces integrity rules, thus minimising data
redundancy and maximising data consistency.
8. DB Access lang. and Application Programming Interfaces: provides data access through
a query lang.
9. DB Communication Interfaces: Current-generation DBMS accept end-user requests via
multiple, different network envts e.g DBMS might provide access to the DB via the internet
through the use of web browsers such as Mozilla Firefox or internet explorer.
Data Models
Hierarchical Model e.g ADABAS
It is developed in 1960s to manage large amounts of data for complex manufacturing projects e.g
Apollo rocket that landed on the moon in 1969. basic logical structure is represented by an upside-
down tree
HM contains levels or segments. Segment is equivalent of a file systems record type. The root
segment is the parent of the level 1 segments which in turn are the parents of the level to segment
etc. other segments below are children of the segment above.
In short, HM depicts a set of one-to-many (1:*) relationships between a parent and its children
segments. (Each parent can have many children but each child has only one parent.
Limitations:
i. Complex to implement
ii. Difficult to manage
iii. Lack structural independence
Hierarchical Structure diag
Network Model (NM) e.g IMS Information Mgt Syst.
NM was created to represent complex data relationship more effectively than the HM to improve
DB performance and to impose DB standard. Lack of DB standard was troubles some to
programmers and application designers because it made designs and applications less portable.
Conference on Data System Lang. (CODASYL) created the Database task Group (DBTG) that
defined crucial components:
Network Schema: conceptual org. of the entire DB as viewed by the DB administrator. It
includes definition of the DB name, record type for each record and the components that
make up those records.
Network Subschema: defines the portion of the DB seen by the program that actually
produce the desired information from the data contained within the DB. Existence of
subschema definitions allows all DB program to simply involve the subschema required to
access the application DB file(s).
Data Mgt Lang. (DML): defines the environment in which data can manage. To produce
desired standardisation for each of the tree components, the DBTG specified 3 distinct DML
components.
- A schema data definition lang (DDL): enables DBA to define the schema
components
- Subschema DDL: allows application program to define DB component that will be
used by the application.
- Data Manipulation Lang: to work with the data in the DB.
The NM allows a record to have more than one parent. A relationship is called a SET. Each set is
composed of at least 2 record types i.e an owner record that is equivalent to hierarchical models
parent and a member record that is equivalent to the hierarchical models child. A set represents a
1:* relationship between owner and member.
Network Model diag
The Relational Model e.g Oracle, DB2
RM is implemented through a Sophisticated Relational Database Mgt System (RDBMS). RDBMS
performs same basic functions provided by the hierarchical and network DBMS system.
Important advantage of the RDBMS is the user. The RDBMS manages all of the physical details
while the user sees the relational DB as a collection of tables in which data are stored and can
manipulate and query data in a way that seems intuitive and logical.
Each table is a matrix consisting of a series of row/column intersections. Tables also called
Relations; are related to each other through the sharing of a field which is common to both entities.
-A relational diagram is a representation of relational DBs entities, the attributes with those
entities and the relationship between those entities.
- A relational table stores collection of related DBs entities, therefore, relational DB table
resembles a file. Crucial difference between table and a file:
A table yields complete data and structural independent cos it is a purely logical
structure.
- Reason for the relational DB models rise to dominance is its powerful and flexible query
lang.
RDBMS uses SQL (4GL) to translate use queries into instructions for retrieving the requested data.
Object-Oriented Model (OOM) (check appendix G)
In object-oriented data model (OOM), both data and their relationship are contained in a single
structure known as an OBJECT.
Like the relational models entity, an object is described by its factual content. But quite unlike an
entity, an object includes information about relationships between the facts within the object, as
well as info about its relationship with other objects. Therefore, the facts within the object are given
greater meaning. The OODM is said to be a semantic data model cos semantic indicates meaning.
OO data model is based on the following components:
i. An object is an abstraction of a real-world entity i.e object may be considered equivalent to
an ER models entity. An object represents only one individual occurrence of an entity.
ii. Attributes describe the properties of an object.
iii. Object that share similar characteristic are grouped in classes. A class is a collection of
similar objects with shared structure (attributes) and behaviour (method). Class resembles
the ER models entity set. However, a class is diff from an entity set in that it contains set of
procedures known as Methods. A classs method represents a real-world action such as
finding a selected Persons name, changing a Persons name.
iv. Classes are organised in a class hierarchy. The class hierarchy resembles an upside-down
tree in which each class has only one parent e.g CUSTOMER class and EMPLOYEE class
share a parent PERSON class.
v. Inheritance is the ability of an object within the class hierarchy to inherit the attributes and
method of classes above it, for example- 2 classes, CUSTOMER and EMPLOYEE can be
created as subclasses from the class PERSON. In this case, CUSTOMER and EMPLOYEE
will inherit all attributes and methods from PERSON.
Entity Relationship Model
Complex design activities require conceptual simplicity to yield successful results although the
relational model was a vast improvement over the hierarchical and network models. It still lacked
the features that would make it an effective database design tool. Because it is easier to examine
structures graphically than to describe them in text, database designers prefer to use a graphical tool
in which entities and their relationship are pictured. Thus, the entity relationship (ER) model or
ERM, has become a widely accepted standard for data modelling.
One of the more recent versions of Peter Chens Notations is known as the Crows Foot Model. The
Crows Foot Notation was originally invented by Gordon Everest. In crows foot notation, graphical
symbols were used instead of using the simple notation such as n to indicate many used by
Chen. The label Crows Foot is derived from the three-pronged symbol used to represent the
many side of the relationship. Although there is a general shift towards the use of UML, many
organisations today still use the Crows Foot Notation. This is particular true in legacy systems
which are running on obsolete hardware and software but are vital to the organisation. It is therefore
important that you are familiar with both Chens and Crows foot modelling notations.
Many recently the class diagram component of thr Unified Modelling Language. has been used to
produce entity relationship models. Although class diagrams have been developed as a part of the
larger UML object-oriented design method, the notation is emerging as the industry data modelling
standard.
The ERM uses ERDs to represent the conceptual database as viewed by the end user. The ERMs
main components are entities, relationships and attributes. The ERD also includes connectivity and
cardinality notations. An ERD can also show relationship strength, relationship participation
(optional or mandatory), and degree of relationship (unary, binary, ternary, etc).
ERDs may be based on many different ERMs.
The Object Oriented Model
Objects that share similar characteristics are grouped in classes. Classes are organized in a class
hierarchy and contain attributes and methods. Inheritance is the ability of an object within the class
hierarchy to inherit the attributes and methods of classes above it.
An OODBMS will use pointer to link objects together. Is this a backwards step?
The OO data Model represents an object as a box; all of the object attributes and relationships to
other objects are included in within the object box. The object representation of the INVOICE
includes all related objects within the same object box. Note that the connectivity (1:1 and 1:*)
indicate relationship of the related objects to the invoice.
The ER model uses 3 separate entities and 2 relationships to represent an invoice transaction. As
customers can buy more than one item at a time, each invoice references one or more lines, one
item per line.
The Relational Model
Developed by Codd (IBM) in 1970
Considered ingenious but impractical in 1970
Conceptually simple
Computers use to lacked the power to implement the relational model
Today, microcomputers can run sophisticated relational database software.
Advantages of DBMS
Control of data redundancy
Data consistency
More information from the same amount of data
Sharing of data
Improved data integrity
Improved security
Enforcement of standards
Economy of scale
Balance conflicting requirements
Improved data accessibility and responsiveness
Increased productivity
Improved maintenance through data independence
Increased concurrency
Improved backup and recovery services
Disadvantages of DBMS
Cost of DBMS
Additional hardware costs
Cost of conversion
Complexity
Size
Maintenance / Performance?
Higher dependency / impact of a failure
Degrees of Data Abstraction
ANSI ( American National Standard Institute)
SPARC (Standard Planning and Requirements Committee) defined a framework for data modelling
based on degrees of data abstraction
ANSI-SPARC Three Level Architecture
External Model
A specific representation of external view is known as External Schema. The use of external view
representing subjects of the DB has some important advantage:
+ It makes it easy to identify specific data required to support each business units operations.
+ Makes the designers job easy by providing feedback about the models adequacy.
+ It helps to ensure security constraints in the DB design.
+ It makes application program development much similar.
Conceptual Model
It represents a global view of the entire DB. It is a representation of data as viewed by the entire
organisation. It integrates all ext. views (entities, relationships, constraints and processes) into a
single global view of the entire data in the enterprise known as Conceptual Schema. Most widely
used conceptual model is the ER model. The ER model is used to graphically represent the
conceptual schema.
Advantage:
It provides relatively easily macro-level view of the data envt.
It is independent of both software and hardware.
Software Independence means the model does not depend on the DBMS software used to
implement the model. DBMS software used to implement the model.
Hardware Independence means the model does not depend on the hardware used in the
implementation of the model.
Internal Model
Once a specific DBMS has been selected, the internal model maps the conceptual model to the
DBMS. The internal model is the representation of the DB as seen by the DBMS. An Internal
Schema depicts a specific representation of an internal model, using the DB constructs supported by
the chosen DB. Internal model depends on specific DB software.
When you can change the internal model without affecting the conceptual model you have Logical
Independence. However, the I.M. is also hardware-independent because it unaffected by the choice
of the computer on which software is installed.
Physical Model
It operates at the lowest level of abstraction, describing the way data are saved on storage media
such as disks or tapes. P.M. requires the definition of both the physical storage devices and the
(physical) access methods required to reach the data within those storage devices, making it both
software and hardware dependent.
- When you can change the physical model without affecting the internal model, you have Physical
Independence. Therefore, a change in storage devices/methods and even a change in OS will not
affect the internal model.
External Level
Users view of the database.
Describes that part of database that is relevant to a particular user.
Conceptual Level
Community view of the database.
Describes what data is stored in database and relationships among the data.
Internal Level
Physical representation of the database on the computer.
Describes how the data is stored in the database.
The Importance of Data Models
Data models
Relatively simple representations, usually graphical, of complex real-world data structures
Facilitate interaction among the designer, the applications programmer, and the end user.
Data model is a (relatively) simple abstraction of a complex real-world data envt. DB designers use
data models to communicate with applications programmers and end users. The basic data-
modeling components are entities attributes, relationships and constraints.
Business rules are used to identify and define the basic modeling component within a specific real-
world environment.
Database Modelling
Alternate Notations
Crows Feet Notation
Chen Notation
UML Unified Modelling Language
Review the coverage provided on the Bookss CD in appendix E, regarding these alternative notations.
Data Model Basic Building Blocks
Entity Relationship Diagrams (ERD)
ERD represents the conceptual DB as viewed by the end user. ERD depicts the DBs main
components:
Entity - anything about which data is to be collected and stored
Attribute - a characteristic of an entity
Relationship - describes an association among entities
One-to-many (1:m) relationship
Many-to-many (m:n) relationship
One-to-one (1:1) relationship
Constraint - a restriction placed on the data
Entity Relationship Diagrams
An Entity
A thing of independent existence on which you may wish to hold data on
Example: an Employee, a Department
Entity is an object of interest to the end user. The word entity in the ERM corresponds to a table and
to a row in the relational envt. The ERM refers to a specific table row as an entity instance or entity
occurrence. In UML notation, an entity is represented by a box that is subdivided into 3 parts:
- the top part is used to name the entity a noun, usually written in capital letters.
- Middle part is used to name and describe the attributes.
- Bottom part is used to list the methods. Methods are used only when designing object
relational/object oriented DB models.
The two terms ER Model / ER Diagram are often used interchangeable to refer to the same thing a
graphical representation of a database. To be more precise you would refer to the specific notation
being used as the model, e.g. Chen, crows foot, describing what type of symbols are used. Whereas
an actual example of that notation being used in practice would be called an ER diagram. So the
Model is the notation specification, the diagram is an actual drawing.
Relationships: an association between entities.
Entity types may bear relationship to one another
Example: Employee works in a Department
Recording: which Dept an Emp is in
The relationship could be
Works in
Existence Dependence: an entity is said to existence-dependent if it can exist in the database only
when it is associated with another related entity occurrence. Implementation terms, an entity is
existence-dependent if it has a mandatory foreign key that is foreign key attribute that cannot be
null.
Relationship Strength: this concept is based on how the pry key of a related entity is defined
Weak (Non-identifying) Relationship: exists if the pry key of the related entity does not contain a
pry key component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE. CRS_CREDIT)
CLASS (CLASS_CODE, CRS_CODE, CLASS_SECTION)
Strong (Identifying) Relationship: exists when the pry key of the related entity contains a pry key
component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE, CRS_CREDIT)
CLASS (CRS_CODE, CLASS_SECTION, CLASS_TIME, ROOM_CODE)
Weak Entities
A weak entity is one that meets two conditions:
1. It is existence-dependent; it cannot exist without the entity with which it has a relationship
2. It has a pry key that is partially or totally derived from the parent entity in the relationship.
Attributes: are characteristic of entities for example STUDENT entity includes the attributes
STU_LNAME, STU_FNAME and STU_INITIAL
Domains: attributes have a domain, a domain is the attributes set of possible values.
Relationship Degree
A relationship degree indicates the number of entities or participants associated with a relationship.
Unary Relationships: exists when an association is maintained within a single entity.
Binary Relationships: exist when two entities are associated in a relationship.
Ternary Relationship: exists when three entities are associated.
Recursive Relationship: is one in which a relationship can exist between occurrences of same
entity set. (Naturally, such a condition is found within a unary relationship.)
Compose Entity (Bridge entity):
This is composed of the primary keys of each of the entities to be connected. Example is converting
the *:* relationship into two 1:* relationships.
Composite and Simple Attributes.
Composite attribute is an attribute that can be further subdivided to yield additional attributes e.g
ADDRESS, can be subdivided into street, city, state and postcode.
Simple attribute is an attribute that cannot be subdivided e.g age, sex and marital status.
Single-Valued Attributes: attribute that can have only a single value e.g person can have only one
social security number.
Multivalued Attributes: have many values e.g person may have several university degrees.
Derived attribute: attribute whose value is calculated from other attribute e.g an employees age
EMP_AGE may be found by computing the integer value of the difference between the current date
and the EMP_DOB
Properties of an entity we want to record
Example: Employee number, name
The attributes could be
EMP_NO, EMP_NAME
Relation Types
Relation between two entities Emp and Dept
More than one relation between entities
Lecturer and Student
Teaches - Personal Tutor
Relationship with itself
Called Recursive
Part made up of parts
Degree and cardinality are two important properties of the relational model.
The word relation, also known as a Dataset in Microsoft Access, is based on the mathematical set
theory form which Codd derived his model. Since the relational model uses attribute values to
establish relationships among tables, many database users incorrectly assume that the term relation
refers to such relationships. Many then incorrectly conclude that only the relational model permits
the use of relationships.
A Relation Schema is a textual representation of the DB tables where each table is described by its
name followed by the list of its attributes in parentheses e.g. LECTURER (EMP_NUM,
LECTURER_OFFICE, LECTURER_EXTENSION.
Rows sometimes referred to as Records
Columns are sometimes labeled as Fields.
Tables occasionally labeled as Files.
DB table is a logical rather than a physical concept and the file, record and the field describe
physical concepts.
Properties of a Relation
1. A table is perceived as 2-dimensional structure composed of rows and columns.
2. Each table row (tuple) represents a single entity occurrence within the entity set and must be
distinct. Duplicate rows are not allowed in a relation.
3. Each table column represents an attribute and each column has a distinct name
4. Each cell/column/row intersection in a relation should contain only single data value.
5. All values in a column must conform to the same data format.
6. Each column has a specific range of values known as the Attribute Domain.
7. The order of the rows and columns is immaterial to the DBMS.
8. Each table must have an attribute or a combination of attributes that uniquely identifies each
row.
Cardinality of Relationship
Determines the number of occurrences from one entity to another.
Example: Each Dept there are a number of Employees that work in it.
Cardinality is used to express the maximum number of entity occurrence associated with one
occurrence of the related entity.
Participation determines whether all occurrence of an entity participate in the relationship or not.
Three Types of Cardinality
One-to-Many
Dept Emp
Many-to-Many
Student Courses
Must be resolved into 1:m
One-to-One
Husband Wife (UK Law)
Optionality / Participation
Identifies the minimum cardinality of relationship between entities
0 - May be related
1 - Must be related
Developing an ER Diagram
The process of database design is an iterative rather than a linear or sequential process. An iterative
process is thus one based on repetition of processes and procedures.
1. Create a detailed narrative of the organizations description of operations.
2. Identify the business rules based on the descriptions of operations.
3. Identify Entities
4. Work out Relationships
5. Develop an initial ERD
6. Work out Cardinality/ Optionality
7. Identify the primary and foreign keys
8. Identify Attributes
9. Revise and Review the ERD
Types of Keys
Primary Key - The attribute which uniquely identifies each entity occurrence
Candidate Key - one of a number of possible attributes which could be used as the key field
Composite Key - when more than one attribute is required to identify each occurrence
Composite Pry keys: pry key composed of more than one attribute.
Foreign Key - when an entity has a Key attribute from another entity stored in it
Superkey an attribute ( or combination of attributes) that uniquely identifies each row in a table.
Candidate A superkey that does not contain a subset of attributes that is itself a superkey.
Primary key A candidate key selected to uniquely identify all other attribute values in any given
row. It cannot contain null entries.
Identifiers (Pry Keys): the ERM uses identifiers to uniquely identify each entity instance.
Identifiers
are underlined in the ERD, key attributes are also underlined when writing the
relational schema.
Secondary Key An attribute (combination of attribute) used strictly for data retrieval purposes.
Foreign Key An attribute (or combination of attribute) in one table whose values must either
match the primary key in another table or be null.
The basic UML ERD
The basic Crows foot ERD
Example Problem 1
A college library holds books for its members to borrow. Each book may be written by more than
one author. Any one author may have written several books. If no copies of a wanted book are
currently in stock, a member may make a reservation for the title until it is available. If books are
not returned on time a fine is imposed and if the fine is not paid the member is barred from loaning
any other books until the fine is paid.
ER Diag One
Example Problem 2
A local authority wishes to keep a database of all its schools and the school children that are
attending each school. The system should also be able to record teachers available to be employed
at a school and be able to show which teachers teach which children. Each school has one head
teacher whos responsibility it is to manage their individual school, this should also be modelled.
Example Problem 3
A university runs many courses. Each course consists of many modules, each module can
contribute to many courses. Students can attend a number of modules but first they must possess the
right qualifications to be accepted on a particular course. Each course requires a set of qualifications
at particular grades to allow students to be accepted, for example the Science course requires a least
2 A levels, one of which must be mathematics at grade B or above. There is the normal teaching
student/lecturer relationship, but you will also have to record personal tutor assignments.
Review Questions ch1
Discuss each of the following: Data, Field, Record, File
What is data redundancy and which characteristic of the file system can lead to it?
Discuss the lack of data independence in file systems.
What is a DBMS and what are its functions?
What is structural independence and why is it important?
Explain the diff betw data and information.
What is the role of DBMS and what are its advantages
List and describe the diff types of databases.
What re the main components of a database system?
What is metadata?
Explain why database design is important.
What re the potential costs of implementing a database system?
Review Questions ch2
1. Discuss the importance of data modelling.
2. What is a business rule, and what is its purpose in data modelling?
3. How would you translate business rules into data model components?
5. What three languages were adopted by the DBTG to standardize the basic network data model,
and why was such standardisation important to users and designers?
6. Describe the basic features of the relational data model and discuss their importance to the end
user and the designer.
7. Explain how the entity relationship (ER) model helped produce a more structured relational
database design envt.
9. Why is an object said to have greater semantic content than an entity?
10. What is the difference between an object and a class in the object-oriented data model
(OODM)?
12. What is an ERDM, and what role does it play in the modern (production) database envt?
14. What is a relationship, and what three types of relationships exist?
15. Give an example of each of the three types of relationships.
16. What is a table and what role does it play in the relational model?
17. What is a relational diagram? Give an example.
18. What is logical independence?
19. What is physical independence?
20. What is connectivity? Draw ERDs to illustrate connectivity.
Review Questions ch.3
1. What is the difference between a database and a table?
2. What does a database expert mean when he/she says that a database displays both entity
integrity and referential integrity?
3. Why are entity integrity and referential integrity important in a database?
Review Questions ch5
1. When two conditions must be met before an entity can be classified as a weak entity? Give an
example of a weak entity.
2. What is a strong (or identifying) relationship?
4. What is composed entity and when is it used?
6. What is recursive relationship? Give an example.
7. How would you graphically identify each of the following ERM components in a UML model:
I. An Entity
II. The multiplicity (0:*)
8. Discuss difference between a composite key and a composite attribute. How would each be
indicated in an ERD?
9. What two courses of action are available to a designer when he or she encounters a multivalued
attribute?
10. What is a derived attribute? Give example.
11. How is a composite entity represented in an ERD and what is its function? Illustrate using the
UML notation.
14. What three (often conflicting) database requirements must be addressed in database design?
15. Briefly, but precisely, explain the diff betw single-valued attributes and simple attributes. Give
an example of each.
16. What are multivalued attributes and how can they be handled within the database design?
Enhanced Entity Relationship (EER) Modelling ( Extended Entity Relationship Model)
This is the result of adding more semantic constructs to the original entity relationship (ER) model.
Examples of the additional concepts are in EER models are:
Specialisation/Generalization
Super Class/Sub Class
Aggregation
Composition
In modelling terms, an entity supertype is a generic entity type that is related to one or more entity
subtypes, where the entity supertype contains the common characteristic and the entity subtypes
contain the unique characteristic of each entity subtype.
Specialization Hierarchy
Entity supertypes and subtypes are organised in a specialization hierarchy. The specialization
hierarchy depicts the arrangement of higher-level entity supertypes (parent entities) and lower-level
entity subtypes (child entities).
In UML notation subtypes are called Subclasses and supertypes are known as Superclasses
Specialization and Generalization
Specialization is the top-down process of identifying lower-level, more specific entity subtypes
from a higher-level entity supertype. Specialization is based on grouping unique characteristics and
relationships of the subtypes.
Generalization is the bottom-up process of identifying a higher-level, more generic entity supertype
from lower-level entity subtypes. Generalization is based on grouping common characteristics and
relationships of the subtypes.
Superclass An entity type that includes one or more distinct sub-groupings of its occurrences
therefore a generalization.
Subclass - A distinct sub-grouping of occurrences of an entity type therefore a specialization.
Attribute Inheritance: An entity in a subclass represents same real world object as in superclass
and may possess subclass-specific attributes, as well as those associated with the superclass.
Composite and Aggregation
Aggregation is where by a larger entity can be composed of smaller entities e.g.
UniversityDepartments.
A special case of aggregation is known as Composition. This is a much stronger relationship than
aggregation, since when the parent entity instance is deleted, all child entity instances are
automatically deleted.
An Aggregation construct is used when an entity is composed of a collection of other entities, but
the entities are independent of each other.
A Composition construct is used when two entities are associated in an aggregation association
with a strong identifying relationship. That is, deleting the parent deletes the children instances.
Normalization of Database Tables
This is a process for evaluating and correcting table structures to minimise data redundancies,
thereby reducing the likelihood of data anomalies.
Normalization works through a series of stages called Normal Forms. i.e First normal form (1NF),
second normal form (2NF), third normal form (3NF). From a structural point of view, 2NF is better
than 1NF and 3F is better than 2NF. For most business database design purposes, 3NF is as high as
we need to go in normalization process. The highest level of normalization is not always most
desirable. But all most business designs use 3NF as the ideal normal form.
A table is in 1NF when all key attributes are defined and when all remaining attributes are
dependent on the pry key. However a table in 1NF can still contain both partial and transitive
dependencies. (A partial dependency is one in which an attribute is functionally dependent on only
a part of a multiattribute pry key. A transitive dependency is one in which one non-key attribute is
functionally dependent on another non-key attribute). A table with a single-attribute pry key cannot
exhibit partial dependecies.
A table is in 2NF when it is in 1NF and contains no partial dependencies. Therefore, a 1NF table is
automatically in 2NF when its pry key is based on only a single attribute. A table in 2NF may still
contain transitive dependencies.
A table is in 3NF when it is in 2NF and contains no transitive dependencies. When a table has only
a single candidate key, a 3NF table is automatically in BCNF (Boyce-Codd Normal Form).
Normalization Process
Checking ER model using functional dependency
Result - Removes any data duplication problems
Saves excess storage space
Removes insertion, update and deletion anomalies.
Functional Dependency A B
B is functionally dependent on A
If we know A then we can find B
Studno Studname
Review Questions
1. What is an entity supertype and why is it used?
2. What kinds of data would you store in an entity subtype?
3. What is a specialization hierarchy?
Review Questions
1. What is normalization?
2. When is a table in 1NF?
3. When is a table in 2NF?
4. When is a table in 3NF?
5. When is a table in BCNF?
7. What is a partial dependency? With what normal form is it associated?
8. What three data anomalies are likely to be the result of data redundancy? How can such
anomalies be eliminated?
9. Define and discuss the concept of transitive dependency.
11. Why is a table whose pry key consists of a single attribute automatically in 2NF when it is in
1NF.
Relational Algebra and SQL
Relational DB Roots
Relational algebra and relational calculus are the mathematical basis for relational databases.
Proposed by E.F. Codd in 1971 as the basis for defining the relational model.
Relational algebra
Procedural, describes operations
Relational calculus
None procedural / Declarative
Relational Algebra is a collection of formal operations acting on these relations which produce
new relations as a result. The algebra is based on predicate logic and set theory and is discussed as
Procedural Lang. Relational algebra defines theoretical way of manipulating table contents through
a number of relational operators.
Set Theory
Relational Algebra Operators
UNION
INTERSECT
DIFFERENCE
SELECT (Restrict)
PROJECT
CARTESIAN PRODUCT
DIVISION
JOIN
The SELECT operator denoted by
which is formally defined as
(R) or
<criterion>
(RELATION)
where
(R) is the set of specified tuples of the relation R and is the predicate (or criterion) to
extract the required tuples.
The PROJECT operator returns all values for selected attributes and is formally defined as
a1...an
(R) or
<list of attributes>
(Relation)
Relational Operators
Union R U S
builds a relation consisting of all tuples appearing in either or both of two specified
relations.
Intersection R S
Builds a relation consisting of all tuples appearing in both of two specified relations.
Difference (complement) R - S
Builds a relation consisting of all tuples appearing in the first and not the second of two
specified relations.
Union
Intersection
Difference
Select (Restrict) a(R)
extract specified tuples from a specific relation.
Project a,b(R)
extract specified attributes from specified relation.
Cartesian-product R x S
Builds a relation from two specified relations consisting of all possible concatenated pairs of tuples, one
from
each of the two specified relations.
Cartesian Product Example
R
1
(a
1
, a
2
a
n
) with cardinality i and R
2
(b
1
, b
2
..b
n
) with cardinality j is a relation R
3
with degree k =
n+m, cardinality i*j and attributes (a
1
, a
2
..a
n
, b
1
, b
2
..b
n
) this can be denoted R
3
= R
1
x R
2
Division R / S
Takes two relations, one binary and one unary, and builds a relation consisting of all
values of one attribute of the binary relation that match (in the other attribute) all values in the
unary relation.
Join R S
Builds a relation from two specified relations consisting of all possible concatenated pairs
of
tuples, one from each of the two specified relations, such that in each pair the two tuples satisfy
some specified condition.
The DIVISION of 2 relations R
1
(a
1
, a
2
a
m
) with cardinality i and R
2
(b
1
, b
2
..b
m
) with cardinality j
is a relation R
3
with degree k = n-m and cardinality i j
The JOIN of 2 relations R
1
(a
1
, a
2
a
n
) and R
2
(b
1
, b
2
..b
n
) is a relation R
3
with degree k = n+m and
attributes (a
1
, a
2
a
n
, b
1
, b
2
..b
n
) that satisfy a specific join condition.
Division
See page 129
Equijoin Example
i. Complete R
1
x R
2
this first performs a Cartesian Product to form all possible combinations
of the rows R
1
and R
2
ii. Restrict the Cartesian Product to only those rows where the values in certain columns match.
See page 131
Secondary Algebraic Operators
Intersection R S = R - (R - S)
Division R/S = A (R) - A ((A (R)xS)-R)
Join R II
R
S = R (RxS)
Equijoin R II R=S S = R=S (RxS)
Natural Join R II
A
S = A (RxS)
Semijoin R II
R
S = A (R II
R
S)
Example Tables
S1 S# SNAME CITY S2 S# SNAME CITY
S1 Smith London S1 Smith London
S4 Clark London S2 Jones Paris
P P# PNAME WEIGHT SP S# P# QTY
P1 Bolt 10 S1 P1 10
P2 Nut 15 S1 P2 5
P3 Screw 15 S4 P2 7
S2 P3 8
Union S1 U S2
produce a table consisting of rows in either S1 or S2
S1 S# SNAME CITY S2 S# SNAME CITY
S1 Smith London S1 Smith London
S4 Clark London S2 Jones Paris
S1 S# SNAME CITY
S1 Smith London
S4 Clark London
S2 Jones Paris
Intersection S1 S2
Produce a table consisting of rows in both S1 and S2
S1 S# SNAME CITY S2 S# SNAME CITY
S1 Smith London S1 Smith London
S4 Clark London S2 Jones Paris
S1 S# SNAME CITY
S1 Smith
London
Difference S1 S2
Produce table consisting of rows in S1 and not in S2
S1 S# SNAME CITY S2 S# SNAME CITY
S1 Smith London S1 Smith London
S4 Clark London S2 Jones Paris
S1 S# SNAME CITY
S4 Clark London
Restriction city=LondonS2
extract rows from a table that meet a specific criteria
S2 S# SNAME CITY S2 S# SNAME CITY
S1 Smith London S1 Smith London
S2 Jones Paris
Project PnameP
extract values of specified columns from a table
P P# PNAME WEIGHT
P1 Bolt 10
P2 Nut 15
P3 Screw 15
PNAME
Bolt
Nut
Screw
Cartesian product S1 X P
produce a table of all combinations from two other tables
S1 S# SNAME CITY P P# PNAME WEIGHT
S1 Smith London P1 Bolt 10
S4 Clark London P2 Nut 15
P3 Screw 15
S1 S# SNAME CITY
S1 Smith London P1 Bolt 10
S1 Smith London P2 Nut 15
S1 Smith London P3 Screw 15
S4 Clark London P1 Bolt 10
S4 Clark London P2 Nut 15
S4 Clark London P3 Screw 15
Divide P / S
Produce a new table by selecting a column from rows
in P that match every row in S
P Partname S#
Bolt 1
Nut 1
Screw 1
Washer 1
Bolt 2
Screw 2
Washer 2
Bolt 3
Nut 3
Washer 3
S S#
1
2
3 BOLT
WASHER
Which Parts does every Supplier Supply
Natural Join you must select only the rows in which the common attribute values match.
You could also do a right outer join and left outer join to select the rows that have no matching
values in the other related table.
An Inner Join in which only rows that meet a given criteria are selected.
Outer Join returns the matching rows as well as the rows with unmatched attribute values for one
table to be joined.
Natural Join S1 II SP
Produce a table from two tables on matching columns
S1 S# SNAME CITY SP S# P# QTY
S1 Smith London S1 P1 10
S4 Clark London S1 P2 5
S4 P2 7
S2 P3 8
S# SNAME CITY P# QTY
S1 Smith London P1 10
S1 Smith London P2 5
S4 Clark London P2 7
Read pg 132 - 136
Consider
Get supplier numbers and cities for suppliers who supply
part P2.
Algebra
Join relation S on S# with SP on S#
Restrict the results of that join to tuples with P# = P2
Project the result of that restriction on S# and City
Calculus
Get S# and city for suppliers such that there exists a shipment SP with the same S# value and with
P# value P2.
SQL Structured Query Language SQL is a non-procedural lang.
Data Manipulation Language (DML)
Data Definition Language (DDL)
Data Control Language (DCL)
Embedded and Dynamic SQL
Security
Transaction Management
C/S execution and remote DB access
Types of Operations
Data Definition Language DDL
Define the underlying DB structure
SQL includes commands to create DB object such as tables, indexes and views e.g
CREATE TABLE, NOT NULL, UNIQUE, PK, FK, CREATE INDEX, CREATE
VIEW, ALTER TABLE, DROP TABLE, DROP INDEX, DROP VIEW
Data Definition Language
Create / Amend / Drop a table
Specify integrity checks
Build indexes
Create virtual Views of a table
Data Manipulation Language
Retrieving and updating the data
Includes commands to INSERT, UPDATE, DELETE and retrieve data within the DB
tables. E.g INSERT, SELECT, WHERE, GROUP BY, HAVING, ORDER BY,
UPDATE, DELETE, COMMIT, ROLLBACK
Data Manipulation Language
Query the DB to show selected data
Insert, delete and update table rows
Control transactions - commit / rollback
Data Control Language
- Control Access rights to parts of the DB
GRANT to allow specified users to perform specified tasks.
DENY to disallow specified users from performing specified tasks.
REVOKE to cancel previously granted or denied permissions.
UPDATE to allow a user to update records
READ disallows a user to edit the database, can only view the data
DELETE allows a user to delete records in a Database
Reading the Syntax
UPPER CASE = reserved words
lower case = user defined words
Vertical bar | = a choice i.e. asc|desc
Curly braces { } = choice from list
Square brackets [ ] = optional element
Dots = optional repeating items
Syntax of SQL
SELECT [ALL | DISTINCT] {[table.]* | EXPRESSION
[alias],}
FROM table [ alias [,table [alias]]
[WHERE condition]
[GROUP BY expression [,expression]
[HAVING condition]]
[ORDER BY {expression | position}[ASC|DESC]]
[{UNION | INTERSECT | MINUS} query]
Purpose of the Commands
SELECT Specifies which columns to appear
FROM Specifies which table/s to be used
WHERE Applies restriction to the retrieval
GROUP BY Groups rows with the same column value
HAVING Adds restriction to the groups retrieved
ORDER BY Specifies the order of the output
DB Schema is a group of DB objects such as tables and indexes that are related to each other
[CREATE SCHEMA AUTHORIZATION {creator};]
Creating Table Structures:
CREATE TABLE tablename (
Column1 datatype [constraint],
Column2 datatype [constraint],
PRIMARY KEY (column 1),
FOREIGN KEY (column 1), REFERENCES tablename
CONSTRAINT constraint);
Foreign Key Constraint definition ensures that:
You cannot delete a tablename if at least one product row references that tablename.
On the other hand, if a change is made in an existing tablename, that change must be reflected
automatically in other tablename being refereced.
Not Null constraint: is used to ensure that a column does not accept nulls.
Unique constraint: is used to ensure that all values in a column are unique.
Default Constraint is used to assign a value to an attribute when a new row is added to a table.
Check Constraint is met for the specified attribute (that is, the condition is true), the data are
accepted for that attribute.
Purpose of the COMMIT and ROLLBACK commands are used to ensure DB update integrity in
transaction mgt.
The EXISTS special operator: EXISTS can be used wherever there is a requirement to execute a
command based on the result of another query.
A VIEW is a virtual table based on a SELECT query. The query can contain columns, computed
columns, aliases and aggregate functions from one or more tables
[CREATE VIEW viewname AS SELECT query.
Embedded SQL refers to the use of SQL statements within an application programming lang. e.g
COBOL, C++, ASP, JAVA and .NET. The lang. in which the SQL statements are embedded is
called the host lang. embedded SQL is still the most common approach to maintaining procedural
capabilities in DBMS-based applications.
Get remaining note from the slide pg 12 - 18
Review Questions
1. What are the main operations of algebra?
2. What is the Cartesian product? Illustrate your answer with an example.
3. What is the diff betw PROJECTION and SELECTION?
4. Explain the diff betw natural join and outer join?
DBMS Optimization
Database Performance- Tuning Concepts
- Goal of the database performance is to execute queries as fast as possible.
- Database performance tuning refers to a set of a activities and procedures designed to
reduce response time of the DB system, i.e to try and ensure that an end-user query is
processed by the DBMS in the minimum amount of time.
- The performance of a typical DBMS is constrained by 3 main factors:
i. CPU Processing Power
ii. Available primary Memory (RAM)
iii. Input/Output (Hard disk and network) throughput.
System Resources Client Server
Hardware CPU fastest possible Multiple processors
Fastest possible
Quad core Intel 2.66GHz
RAM Max. possible Max. possible (64GB)
Hard disk Fast IDE hard disk with
sufficient free hard disk
space.
Multiple high speed, high
capacity e.g 750GB
Network High-speed connection High-speed connection
Software Operating System Fined-tuned for best client
application performance
Fine-tuned for best server
application performance
Network Fine-tuned for best
throughput
Fine-tuned for best
throughput
Application Optimize SQL in client
application
Optimize DBMS for the
best performance.
The system performs best when its hardware and software resources are optimized. Fine-tuning the
performance of a system requires a holistic approach, i.e all factors must be checked to ensure that
each one operates at its optimum level and has sufficient resources to minimize the occurrence of
bottlenecks.
Note: Good DB performance starts with good DB design. No amount of fine-tuning will make a
poorly designed DB perform as well as a well-designed DB.
Performance Tuning: Client and Server
- On the client side, the objective is to generate a SQL query that returns the correct answer in the
least amount of time, using the minimum amount of resources at the server end. The activities
required to achieve that good resources are commonly referred to as SQL Performing Tuning.
- On the server, DBMS envt must be properly configured to response to clients request in the fastest
way possible, while making optimum use of existing resources. The activities required to achieve
that goal commonly referred to as DBMS Performance Tuning.
DBMS Architecture
It is represented by the processes and structures (in memory and in permanent storage) used to
manage a DB.
DBMS diag
DBMS Architecture Component and Functions
- All data in DB are stored in DATA FILES.
Data file can contain rows from one single table or it can contain rows from many diff
tables. DBA determines the initial size of the data files that make up the DB.
Data files can automatically expand in predefined increments known as EXTENDS. For
example, if more spaces is required, DBA can define that each new extend will be in 10KB
or 10MB increments.
Data files are generally grouped in file groups creating tables spaces. A table space or file
group is a logical grouping of several data files that store data with similar characteristic
- DBMS retrieve the data from permanent storage and place it in RAM (data cache)
- SQL Cache or Procedure Cache is a shared, reserved memory area that stores most recently
executed
- SQL statements or PL/SQL procedures including triggers and functions.
- Data Cache or Buffer Cache is a shared, reserved memory area that stores the most recently
accessed data blocks in RAM.
-To move data from permanent storage (data files) to the RAM (data cache), the DBMS issues I/O
requests and waits for the replies. An input/output request is a low-level (read or write) data access
operation to/from computer devices. The purpose of the I/O operation is to move data to and from
diff computer component or devices.
- Working with data in data cache is many times faster than working with data in data files cos
DBMS doesnt have to wait for hard disk to retrieve data.
- Majority of performance-tuning activities focus on minimising number of I/O operations.
Processes are:
Listener: listens for clients request and hands the processing of the SQL requests to other DBMS
processes.
User: DBMS creates a user process to manage each client session
Scheduler: schedules the concurrent execution of SQL requests.
Lock Manager: manages all locks placed on DB objects
Optimizer: analyses SQL queries and finds the most efficient way to access the data.
Database Statistics: it refers to a number of measurements about DB object such as tables, indexes
and available resources such as number of processors used, processor speed and temporary space
available. Those statistics give a snapshot of DB characteristics.
Reasons for DBMS Optimiser
DBMS prevents direct access to DB
Optimiser is part of the DBMS
Optimiser processes user requests
Removes need for knowledge or data format-hence data independency
Reference to data dictionary
- Therefore increased productivity
- Provides Ad-hoc query processing.
Query Processing
DBMS processes queries in 3 phases:
~. Parsing: DBMS parses the SQL query and chooses the most efficient
access/execution plan.
~. Execution: DBMS executes the SQL query using the chosen execution plan.
~. Fetching: DBMS fetches the data and sends the result set back to the client.
The SQL parsing activities are performed by the query optimiser.
The Query Optimiser analyses the SQL query and finds the most efficient way to access the data.
Parsing a SQL query requires several steps:
+ Interpretation
- Syntax Check: validate SQL statement
- Validation: confirms existence (table/Attribute)
- Translation: into relational algebra
- Relational Algebra optimisation
- Strategy Selection execute plan
- Code generation: executable code
+ Accessing (I/O Disk access): read data from the physical data files and genearate the result
set.
+ Processing Time: (CPU Computation) process data (cpu)
Query Optimisation:
Is the central activity during the parsing phase in query processing. In this phase, the DBMS must
choose what indexes to use how to perform join operations, what table to use first and so on.
Indexes facilitate searching, sorting and using aggregate functions and even join operations. The
improvement in data access speed occurs because an index is an ordered set of values that contain
the index key and pointers.
An Optimizer is used to work out how to retrieve the data in the most efficient way from a database.
Types of Optimisers
Heuristic (Rule-based): uses a set of preset rules and points to determine the best approach to
execute a query.
-15 rules, ranked in order of efficiency, particular access path for a table only chosen if
statement contains a predicate or other construct that makes that access path available.
-Score assigned to each execution strategy using these rankings and strategy with best
(lowest) selected.
The Rule Based (heuristic) optimizer uses a set of rules to quickly choose between alternate
options to retrieve the data. It has the advantage of quickly arriving at a solution with a low
overhead in terms of processing, but the disadvantage of possibly not arriving at the most optimal
solution.
Cost Estimation (Cost based): uses sophisticated algorithm based on statistics about the objects
being accessed to determine the best approach to execute a query. The optimiser process adds up
the processing costs, the I/O costs and the resource cost (RAM and temporary space) to come up
with the total cost of a given execution plan.
The Cost Based optimizer uses statistics which the DBA instructs to be gathered from the
database tables and based on these values it estimates the expected amount of disk I/O and CPU
usage required for alternate solutions. It subsequently chooses the solution with the lowest cost and
executes it. It has the advantage of being more lightly to arrive at an optimal solution, but the
disadvantage of taking more time with a higher overhead in terms of processing requirements.
Cost-based + hints:
The Cost Based optimizer (with Hints) is the same as the Cost Based optimizer with the additional
facility of allowing the DBA to supply Hints to the optimizer, which instructs it to carry out certain
access methods and therefore eliminates the need for the optimizer to consider a number of
alternative strategies. It has the advantage of giving control to the DBA who may well know what
would be the best access method based on the current database data, plus the ability to quickly
compare alternate execution plans, but it has the disadvantage of taking us back to hard coding
where the instructions on retrieving data are written into the application code. This could lead to the
need to rewrite application code in the future when the situation changes.
Optimiser hints are special instructions for the optimiser that are embedded inside the SQL
command text.
Query Execution Plan (QEP)
SELECT ENAME
FROM EMP E, DEPT D
WHERE E.DEPTNO = D.DEPTNO
AND DNAME = RESEARCH;
OPTION 1: JOIN S ELECT - PROJECT
OPTION 2: SELECT JOIN - PROJECT
To calculate QEP to compare cost of both strategies: diag is here
Cost-based:
Make use of statistics in data dictionary
No of rows
No of blocks
No of occurrences
Largest/smallest value
Then calculates the cost of alternate solutions query
Statistics
Cost-based Optimiser: depends on statistics for all tables, clusters and indexes accessed by query.
Users responsibility to generate these statistics and keep them current.
Package DBMS_STATS can be used to generate and manage statistics and histograms.
Statistic Procedure and provides the Auto-update and Auto-create statistic options in its
initialization parameters.
Gathering Statistic: Use ANALYSE-
Analyse table TA compute statistic exact
Analyse table TA estimate statistics %selection
Analyse index TA_PK estimate statistics
Accessing Statistics:
View USER_TABLES
View USER_TAB_COLUMNS
View USER_INDEX
Check for Pro and Cons of each types of optimisers
Exercise WK4 solution .
Review Questions
1. What is SQL performance tuning?
2. What is database performance tuning?
3. What is the focus of most performance-tuning activities and why does that focus exist?
4. What are database statistics, and why are they important?
5. How are DB statistic obtained?
6. What DB statistic measurements are typical of tables, indexes and resources?
7. How is the processing of SQL DDL statements (such as CREATE TABLE) different from the
processing required by DML statements?
8. In a simple terms, the DBMS processes queries in three phases. What are those phases and
what is accomplished in each phase?
9. If indexes are so important, why not index every column in every table?
10. What is the difference between a rule-based optimizer and a cost-based optimiser?
11. What are optimizer hints, and how are they used?
12.
13. What recommendations would you make for managing the data files in a DBMS with many
tables and indexes?
Production System
DB is a carefully designed and constructed repository of facts. The fact repository is a part of a
larger whole known as an Information System.
An Information System provides for data collection, storage and retrieval. It also facilitates the
transformation of data into info. and the mgt of both data and information. Complete information
System is composed of people, hardware, software, the DB, application programs and procedures.
System Analysis is the process that establishes the need for and the scope of an info.system. The
process of creating an info syst. is known as System Development.
A successful database design must reflect the information system of which the database is a part. A
successful info system is developed within a framework known as the Systems Development Life
Cycle (SDLC). Application transforms data into that forms that basis for decision making. The most
successful DB is subject to frequent evaluation and revision within a framework known as the DB
Life Cycle (DBLC)
Database Design Strategies: Top-down vs Bottom-up and Centralized vs decentralized.
The information Systems
Applications
- Transform data into that forms basis for decision making
- Usually produce the following: formal report, tabulations, graphic displays
- Every application is composed of 2 parts:
Data and Code by which data are transformed into information.
The performance of an information system depends on a triad of factors:
- DB design and implementation
- Application design and implementation
- Administrative procedures
The term DB Development: describes the process of DB design and implementation. The primary
objective in DB design is to create complete, normalized, non-redundant and fully integrated
conceptual, logical and Physical DB models.
System Deveopment Life Cycle (SDLC)
The SDLC is an iterative rather than a Sequential process.
SDLC divided into five phases:
Planning: such an assessment should answer some important questions.
- Should the existing system be continued
- Should the existing system be modified.
- Should the existing system be replaced.
The feasibility study must address the following:
- The technical aspects of hardware and software requirements
- The system cost.
Analysis: problems defined during the planning phase are examined in greater detail during
analysis phase.
Addressing questions are:
~. What are the requirements of the current systems end users?
~. Do those requirements fit into the overall info requirements?
The analysis phase of the SDLC is, in effect, a thorough audit of user requirements. The existing
hardware and software system are also studied in order to give a better understanding of the
systems functional area, actual and potential problems and opportunities. The A.P. also includes
the creation of a logical system design. The logical design must specify the appropriate conceptual
data model, inputs, processes and expected output requirements. When creating logical design,
designer might use tools such as data flow diag (DFDs), hierarchical Input Process Output (HIPO)
diagram and ER diag.
Defining the logical system also yields functional description of the systems components (modules)
for each process within the DB envt.
Detailed Systems Design: complete the design of the systems processes. The design includes all
necessary technical specifications for the screens, menus, reports and other devices that might be
used to help make the system a more efficient information generator.
Implementation: the hardware, DBMS software and application programs are installed and the DB
design is implemented. During the initial stages of the I.P. the system enters into a cycle of coding,
testing and debugging until it is ready to be delivered. The DB contents may be loaded interactively
or in batch mode, using a variety of methods and devices:
- Customised user programs
- DB interface program
- Conversion program that import the data from a different file structure using batch
program, a DB utility or both.
The sysem is subjected to exhaustive testing until it is ready for use. After testing is concluded, the
final documentation is reviewed and printed and end users are trained.
Maintenance: as soon as the system is operational end users begin to request changes in it. These
changes generate system maintenance:
- Corrective maintenance in response to system errors.
- Adaptive maintenance due to changes in the business envt.
- Perfective maintenance to enhance the system.
The DB Life Cycle DBLC) : it contains 6 phases
Database Design Strategies
Two classical approaches to DB design:
Top-down Design:
- Identify data sets
- Defines data elements for each of these sets.
This process involves the identification of different entity types and the definition of each entitys
attributes.
Bottom-up Design
- Identifies data elements (items)
- Groups them together in data sets
i.e it first defines attributes, then groups them to form entities.
The selection of a primary emphasis on top-down or bottom-up procedures often depends on the
scope of the problem or personal preferences. The 2 methodologies are complementary rather than
mutually exclusive.
Top-down vs Bottom-up Design Sequencing
Even when a primarily top-down approach is selected, the normalization process that revises
existing table structure is (inevitably) a bottom-up technique. ER models constitute a top-down
process even when the selection of attributes and entities can be described as bottom-up because
both ER model and normalization technique form the basis for most designs, the top-down vs
bottom-up debate may be based on a distinction rather than a difference.
Production System continues
Use estimate to refresh statistics
Use declarative & procedural integrity
Use stored PL/SQL procedures
already compiled
shared pool cache
System configuration
System Configuration
Size & configuration of the DB caches
Number/size of data, buffer cache
Size of shared pool
SQL, PL/SQL, Triggers,
Data Dictionary
Log buffers
Options for the DBA
Table structure
Heap
Hash
ISAM
BTree
The main difference between the table structures is as follows:
The Heap table has no indexing ability built into it and so if left as a Heap would require a
secondary index if it was large and speedy access was required. The others have indexing ability
built into them but the Hash and ISAM would degrade over time if lots of modifications where
made to it - the additional data simply being added to the end as a heap in overflow pages, this is as
apposed to the BTree which is dynamic and grow as data is added.
Data Structures
Heap
No key columns
Queries, other than appends, scan every page
Rows are appended at the end
1 main page, all others are overflow
Duplicate rows are allowed
Do Use when:
Inserting a lot of new rows
Bulk loading data
Table is only 1-2 pages
Queries always scan entire table
Do Not Use when:
You need fast access to 1 or a small subset of rows
Tables are large
You need to make sure a key is unique
Hash
Do Use when:
Data is retrieved based on the exact value of a key
Do Not Use when:
You need pattern matching, range searches
You need to scan the table entirely
You need to use a partial key for retrieval
ISAM
Do Use when:
Queries involve pattern matching and range scans
Table is growing slowly
Key is large
Table is small enough to modify frequently
Do Not Uses when:
Only doing exact matches
Table is large and is growing rapidly
Btree
Index is dynamic
Access data sorted by key
Overflow does not occur if there are no duplicate keys
Reuses deleted space on associated data pages
Do Use when
Need pattern matching or range searches on the key
Table is growing fast
Using sequential key applications
Table is too big to modify
Joining entire tables to each other
Do Not Use when:
Table is static or growing slowly
Key is large
Creating Indexes:
Key fields
Foreign keys
Access fields
Disk Layout
Multiple Disks
Location of tables/index
Log file
DBMS components
Disk stripping
Other factors
CPU
Disk access speed
Operating system
Available memory
swapping to disk
Network performance
De-normalisation
Including children with parents
Storing most recent child with parent
Hard-coding static data
Storing running totals
Use system assigned keys
Combining reference of code tables
Creating extract tables
Centralized vs Decentralized Design
Two general approaches (bottom-up and top-down) to DB design can be influenced by factors such
as the scope and Size of the system, the companys mgt style and the companys structure
(centralised or decentralised).
Centralized Design is productive when the data component is composed of a relatively small
number of objects and procedures. Centralised design is relatively simple and/or small can DB and
can be successfully done by single person (DBA)
The company operations and the scope of the problem are sufficiently limited to allow even a single
designer to define the problems create conceptual design, verify conceptual design with the user
view.
Decentralized Design: this might be used when the data component of the system has a
considerable number of entities and complex relations on which very complex operations are
performed. Decentralised design likely to be employed when the problem itself is spread across
several operations sites and each element is a subset of the entire data set.
Carefully selected team of DB designers is employed to tackle a complex DB project. Within the
decentralised design framework, the DB design task is divided into several modules. Each design
group creates a conceptual data model corresponding to the subset being modelled. Each conceptual
model is then verified individually against user views, processes and constraints for each of the
modules. After the verification process has been completed, all modules are integrated into one
conceptual model.
Naturally, after the subsets have been aggregated into a larger conceptual model, the lead designer
must verify that the combined conceptual model is still able to support all of the required
transactions.
Database Design
Conceptual, Logical and Physical Database Design.
Conceptual DB Design is where we create the conceptual representation of the DB by producing a
data model which identifies the relevant entities and relationship within our system.
Logical DB Design is where we design relations based on each entity and define integrity rules to
ensure there is no redundant relationship within our DB.
Physical DB Design is where the physical DB is implemented in the target DBMS. In this stage we
have to consider how each relation is stored and how data is accessed.
Three Stages of DB Design
Selecting a suitable file organisation is important for fast data retrieval and efficient use of storage
space. 3 most common types of file organisation are:
Heap Files: which contain randomly ordered records.
Indexed Sequential Files: which are stored on one or more fields using indexes.
Hashed Files: in which a hashing algorithm is used to determine the address of each record
based upon the value of the primary key.
Within a DBMS indexes are often stored in data structure known as B-trees which allow fast data
retrieval. Two other kinds of indexes are Bitmap Indexes and Join Indexes. These are often used on
multi-dimensional data held in data warehouses.
Indexes are crucial in speeding up data access. Indexes facilitate searching, sorting and using
aggregate functions and even join operations. The improvement in data access speed occurs because
an index is an ordered set of value that contains the index key and pointers.
Data Sparsity refers to the number of different values a column could possibly have. Indexes are
recommended in highly sparse columns used in search conditions.
Concurrency and Recovery
A transaction is any action that reads from and/or writes to a DB. A transaction may consist of a
simple SELECT statement to generate a list of table contents. Other statements are UPDATE,
INSERT, or combinations of SELECT, UPDATE & INSERT statement.
A transaction is a logical unit of work that must be entirely completed or entirely aborted, no
intermediate states are acceptable. All of the SQL statements in the transaction must be completed
successfully. If any of the SQL statements fail, the entire transaction is rolled back to the original
DB state that existed before the transaction started.
A successful transaction changes the DB from one consistent state to another. A consistent DB
State is one in which all data integrity constraints are satisfied.
To ensure consistency of the DB, every transaction must begin with the DB in a known consistent
State. If the DB is not in a consistent state, the transaction will yield an inconsistent DB that
violates its integrity and business rules. All transactions are controlled and executed by the DBMS
to guarantee DB integrity. Most real-world DB transaction are formed by the 2 or more DB
requests- Equivalent of a single SQL statement in an application program or transaction.
Terms to know
Transaction: logical unit of work
Consistent State: DB reflecting true position
Concurrent: at the same time.
Sequence: Read disk block, update data, rewrite disk
Serializability: Ensures that concurrent execution of several transactions yields consistent results.
Transaction properties
All transaction must display atomicity, consistency, isolation durability and serializability (ACIDS
test)
Atomicity: requires that all operations (SQL requests) of a transaction be completed; if not,
the transaction is aborted.
Consistency: indicates the performance of the DBs consistent state, when a transaction is
completed, the DB reaches a consistent state.
Isolation: means that the data used during the execution of a transaction cannot be used by a
2nd transaction until the 1st one is completed.
Durability: ensures that once transaction changes are done (committed), they cannot be
undone or lost even in the event of a system failure.
Serializability: ensures that the concurrent execution of several transanctions yield
consistent results. This is important in multi-user and distributed databases where multiple
transaction are likely to be executed concurrently. Naturally, if only a single transaction is
executed, serializaability is not an issue.
The Transaction Log
DBMS uses a transaction log to keep track of all transactions that update the DB. The information
stored in this log is used by the DBMS for a recovery requirement triggered by a ROLLBACK
statement.
Log with Deferred Updates
- Transaction recorded in log file
- Updates are not written to the DB
- Log entries are used to update the DB
In the event of a failure..
- Any transactions not completed are ignored
- Any transactions committed are redone
- Checkpoint used to limit amount of rework.
Log with Immediate Updates
- Writes to DB as well as the log file
- Transaction record contains old and new value
- Once log record written DB can be updated.
In the event of failure..
- Transaction not completed are undone old values.
- Updates take place in reverse order
- Transaction committed are redone new value.
Concurrency Control
The coordination of the simultaneous execution of transaction in a multi-user DB system is known
as Concurrency Control. The objective of concurrency control is to ensure the serializability of
transaction in a multi-user DB envt. Concurrency control is important because the simultaneous
execution of transactions over a shared DB can create several data integrity and consistency
problems. Both disk I/O and CPU are used.
3 main problems are
Lost Updates
Uncommitted Data
Inconsistent Retrievals.
Uncommitted Data occurs when 2 transactions are executed concurrently and the 1st transaction is
rolled back after the second transaction has already accessed the uncommitted data thus violating
the isolation property of transactions.
Inconsistent Retrievals: occur when a transaction calculates some summary (aggregates) functions
over a set of data while other transactions are updating the data. The problem is that the transaction
might read some data before they are changed and other data after they are changed, thereby
yielding inconsistent results.
Lost Updates: When 1
st
transaction T1 has not yet been committed when the 2
nd
transaction T2 is
executed. Therefore T2 still operates on the initial value of T1.
The Scheduler:
Is responsible for establishing the order in which the concurrent transaction operations are executed.
The transaction execution order is critical and ensures DB integrity in multi-user DB systems.
Locking, Time-stamping and Optimistic methods are used by the scheduler to ensure the
serializabilty of transactions.
Serializability of Schedules is guaranteed through the use of 2-phase locking. The 2-phase locking
schema has a growing phase in which the transaction acquires all of the locks that it needs without
unlocking any data and a shrinking phase in which the transaction releases all of the locks without
acquiring new locks.
Serializability:
Serial execution means performing transactions one after another
If 2 transactions are only reading a variable, they do not conflict and order is not important.
If 2 transactions operate on separate variables, they do not conflict and order is not important.
Only when a transaction writes to a variable and another either reads or writes to the same variable,
then order is important. Serializability is making sure that when it counts, transaction operates in
order.
Lock Granularity:
It indicates the level of lock use. Locking can take place at the following levels: database, table,
page, row, or even field (attribute)
Database Level Lock: the entire DB is locked, thus preventing the use of any tables in the DB by
transaction T2 while transaction T1 is being executed. This level of locking is good for batch
processes, but it is unsuitable for online multi-user DBMS.
Note that transaction T1 and T2 cannot access the same DB concurrently even when they use diff
tables.
Table Level Lock:
The entire table is locked, preventing access to any row by transaction T2 while transaction T1 is
using the table. If a transaction requires access to several tables, each table may be locked. However
2 transactions can access the same DB as long as they access diff tables.
Page Level Lock
DBMS locks an entire disk page. A disk page or page is the equivalent of a disk block which can be
described as directly addressable section of a disk. A page has a fixed size.
Row Level Lock:
It is much less restrictive than the locks discussed above. DBMS allows concurrent transactions to
access diff rows of the same table, even when the rows are locked on the same page.
Field Level Lock:
It allows concurrent transaction to access the same row as long as they require the use of diff fields
(attributes) within that row.
Lock Types:
Shared/Exclusive Locks: an exclusive lock exists when access is reserved specifically for the
transaction that locked the object.
Read (shared) Lock: allows the reading but not updating of a data item, allowing multiple
accesses.
Write (Exclusive) allows exclusive update of a data item.
A shared Lock is issued when a transaction wants to read data from the DB and no exclusive lock
is held on that data item.
An Exclusive Lock is issued when a transaction wants to update (Write) a data item and no locks
are currently held on that data item by any other transaction.
Two-Phase Locking:
Defines how transactions acquire and relinquish locks. It guarantees serializability, but it does not
prevent deadlocks the two phases are:
1. Growing Phase: transaction acquires all required locks without unlocking any data. Once
all locks have been acquired the transaction is in its locked point.
2. Shrinking Phase: transaction releases all locks and cannot obtain any new lock.
The two-phase locking protocol is governed by the following rules:
2 transactions cannot have conflicting locks.
No unlock operation can precede a lock operation in the same transaction.
No data are affected until all locks are obtained i.e until the transaction is in its locked point.
Deadlocks:
A deadlock occurs when 2 transactions wait for each other to unlock data.
Three Basic Techniques to Control Deadlocks:
+ Deadlock Prevention: a transaction requesting a new lock is aborted when there is the
possibility that a deadlock can occur. If the transaction is aborted, all changes made by this
transaction are rolled back and all locks obtained by the transaction are released. (statically
make deadlock structurally impossible )
+ Deadlock Detection: DBMS periodically tests the DB for deadlocks if a deadlock is found,
one of the transactions (the victim) is aborted (rolled back and restarted) and the other
transaction continues. (let deadlocks occur, detect them and try to recover)
+ Deadlock Avoidance: The transaction must obtain all of the locks it needs before it can be
executed. (avoid deadlocks by allocating resources carefully)
Concurrency Control with Time-Stamping Methods:
Time-stamping: the time-stamping approach to scheduling concurrent transactions assign a global,
unique time stamp to each transaction.
Time stamps must have two properties: Uniqueness and Monotomicity.
Uniqueness ensures that no equal time stamp values can exist.
Monotomicity ensures that time stamp values always increase.
All DB operates (Read and Write) within the same transaction must have the same time stamp. The
DBMS executes conflicting operations in time stamp order, thereby ensuring serializability of the
transactions. If 2 transaction conflict, one is stopped, rolled back, rescheduled and assigned a new
time stamp value. No locks are used so no deadlock can occur.
Disadvantage of the time stamping approach is that each value stored in the DB requires 2
additional time stamp fields.
Concurrency Control With Optimistic Methods:
The optimistic approach is based on the assumption that the majority of that DB operations do not
conflict. The optimistic approach does not require locking or time stamping technique. Instead a
transaction is executed without restrictions until is committed. Each transaction moves through 2 or
3 phases which are READ, VALIDATION and WRITE.
- Some envts may have relatively few conflicts between transactions.
- Locking would be an inefficient overhead.
- Eliminate this by optimistic technique
- Assume there will be no problems
- Before committing a check done
- If conflict occurred transaction is rolled back.
Database Recovery:
DB recovery restores a DB from a given state, usually inconsistent, to a previously consistent state.
Need for Recovery:
- Physical disasters fire, flood
- Sabotage internal
- Carelessness unintentional
- Disk Malfunctions headcrash, unreadable tracks
- System crashes hardware
- System software errors termination of DBMS
- Application software errors logical errors.
Recovery Technique: are based on the atomic transaction property, all portions of the transaction
must be treated as a single, logical unit of work in which all operations are applied and completed to
produce a consistent DB.
- Technique to restore DB to a consistent state.
- Transactions not completed rolled back
- To record transaction using a log file.
- Contains transaction and checkpoint records
- Checkpoint record lists current transactions.
Four Important Concepts that Effect the Recovery Process:
+ Write-ahead-log protocol: ensures that transaction logs are always written before any DB
are actually updated.
+ Redundant Transaction Log: most DBMS keep several copies of the transaction log to
ensure that a physical disk failure will not impair the DBMS ability to recover data.
+ Database Buffers: buffer is a temporary storage area in primary memory used to speed up
disk operations.
+ Database Checkpoints: is an operation in which the DBMS writes all of its updated buffers
to disk. Checkpoint operation is also registered in the transaction log.
Recovery procedure uses deferred Write or deferred update, the transaction operations do not
immediately update the physical DB. Instead only the transaction log is updated. The recovery
process for all started and committed transactions (before the failure) follow these steps:
Identify the last checkpoint in the transaction log.
For a transaction that started and committed before the last checkpoint, nothing
needs to be done because the data are already saved.
For a transaction that performed a commit operation after the last checkpoint, the
DBMS uses the transaction log records to redo, the transaction and to update the DB,
using the after values in the transaction log.
For any transaction that had a Rollback operation after the last checkpoint or that
was left active before the failure occurred, nothing needs to be done because DB was
never updated.
When the recovery procedure was Write-through or Immediate Update the DB is immediately
updated by transaction operations during the transaction execution, even before the transaction
reaches it commit point.
Deadlocks in Distributed Systems
Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse
- They are harder to avoid, prevent or even detect.
- They are hard to cure when tracked down because all relevant information is scattered over
many machines.
Distributed Deadlock Detection
Since preventing and avoiding deadlocks to happen is difficult, researchers works on detecting the
occurrence of deadlocks in distributed system.
The presence of atomic transaction in some distributed systems makes a major conceptual
difference.
When a deadlock is detected in a conventional system, we kill one or more processes to break the
deadlock.
When deadlock is detected in a system based on atomic transaction, it is resolved by aborting one or
more transactions. But transactions have been designed to withstand being aborted. When a
transaction is aborted, the system is first restored to the state it had before the transaction began, at
which point the transaction can start again. When a bit of luck, it will succeed the second time. Thus
the difference is that the consequences of killing off a process are much less severe when
transactions are used.
1 Centralised Deadlock Detection
We use a centralised deadlock detection algorithm and try to imitate the nondistributed algorithm.
Each machine maintains the resource graph for its own processes and resources.
A centralised coordinator maintain the resource graph for the entire system.
In updating the coordinators graph, messages have to be passed.
- Method 1: whenever an arc is added or deleted from the resource graph, a message
have to be sent to the coordinator.
- Method 2: periodically, every process can send a list of arcs added and deleted since
previous update.
- Method 3: coordinator asks for information when it needs it.
One possible way to prevent false deadlock is to use the Lamports algorithm to provide global
timing for the distributed systems.
When the coordinator gets a message that leads to a suspect deadlock:
It sends everybody a message saying I just received a message with a timestamp T which leads to
deadlock. If anyone has a message for me with an earlier timestamp, please send it immediately
When aevery machine has replied, positively or negativel, the coordinator will see the the deadlock
has really occurred on not.
2. The Chandy-Misra-Haas algorithm:
Processes are allowed to request multiple resources at once the growing phase of a
transaction can be speeded up.
The consequence of this change is a process may now on two or more resources at the same
time.
When a process has to wait for some resources, a probe message is generated and sent to the
process holding the resources. The message consists of three numbers: the process being
blocked, the process sending the messga and the process receiving the message.
When message arrived, the recipient checks to see it if itself is waiting for any processes. If
so, the message is updated, keeping the first number unchanged and replaced the second and
third field by the corresponding process number.
The message is then sent to the process holding the needed resources.
If a message goes all the way around and comes back to the original sender
- The process that initiates the probe, a cycle exists and the system is deadlocked.
Review Questions
I. Explain the following statement: a transaction is a logical unit of work.
II. What is a consistsent database state, and how is it achieved?
III. The DBMS does not guarantee that the semantic meaning of the transaction truly represents
the real-world event. What are the possible consequences of that limitation? Give example.
IV. List and discuss the four transaction properties.
V. What is transaction log, and what is its function?
VI. What is scheduler, what does it do, and why is its activity important to concurrency control?
VII. What is lock and how, in general, does it work?
VIII. What is concurrency control and what is its objectives?
IX. What is an exclusive lock and under what circumstance is it granted?
X. What is deadlock, and how can it be avoided? Discuss several deadlock avoidance
stategies.
11. What three levels of backup may be used in DB recovery mgt? briefly describe what each of
those three backup levels does.
Database Security Issues
Types of Security
Legal and ethical issues regarding the right to access certain information. Some
information may be deemed to be private and cannot be accessed legally by unauthorized
persons.
Policy Issues at the governmental, institutional or corporate level as to what kinds of info
should not be made publicly available- for example Credit ratings and personal medical
records.
System-related issues: such as the system level at which various security function should
be enforced for example, whether a security function should be handled at the physical
hardware level, the operating system level or the DBMS level.
The need to identify multiple security levels and to categorize the data and users based on
these classifications for example, top secret, secret, confidential and unclassified. The
security policy of the organisation with respect to permitting access to various
classifications of data must be enforced.
Threats to Database:
This result in the loss or degradation of some or all of the following commonly accepted security
goals: Integrity, Availability and Confidentiality.
Loss of Integrity: Database Integrity refers to the requirement that information be protected
form improper modification. Modification of data includes creation, insertion, modification,
changing the status of data and deletion. Integrity is lost if unauthorised changes are made to
the data by either intentional or accidental acts. If the loss of the system or data integrity is
not corrected, continued use of the contaminated system or corrupted data could result in
inaccuracy, fraud or erroneous decisions.
Loss of Availability: Database availability refers to making objects available to a human
user or a program to which they have a legitimate right.
Loss of Confidentiality: Database confidentiality refers to the protection of data from
unauthorized disclosure. Unauthorized, unanticipated or unintentional disclosure could
result in loss of public confidence, embarrassment or legal action against the organisation.
Control Measures
Four main control measures that are used to provide security of data in databases:
Access control
Inference control
Flow control
Data encryption
Access Control: the security mechanism of a DBMS must include provisions for restricting access
to the database system as a whole. This function is called Access control and is handled by creating
user accounts and passwords to control the login process by the DBMS.
Inference Control: Statistical databases are used to provide statistical information or summaries of
values based on various criteria e.g database for population statistics. Statistical database users e.g
govt.statistician or market researchers firms are allowed to access the database to retrieve statistical
information about population but not to access the detailed confidential information about specific
individuals. Statistical database security ensures that information about individuals cannot be
accessed. It is sometimes possible to deduce or infer certain facts concerning individuals from
queries that involve only summary statistic on groups; consequently, this must not be permitted
either. The corresponding control measures are called Inference Control.
Flow Control: it prevents information from flowing in such a way that it reaches unauthorized
users. Channels that are pathways for information to flow implicitly in ways that violate the security
policy of an organisation are called Covert Channels.
Data Encryption: is used to protect sensitive data such as credit cards numbers, that is transmitted
via some type of communications network. The data is encoded using some coding algorithm. An
unauthorized user who access encoded data will have difficulty deciphering it but authorized users
are given decoding or decrypting algorithms (or keys) to decipher data.
A DBMS typically includes a database security and authorization subsystem that is responsible
for security of portions of a database against unauthorized access.
Two Types of database Security Mechanism:
i) Discretionary Security mechanism: these re used to grant privileges to users including the
capability to access specific data files, records or fields in a specified mode (such as read, insert,
delete or update)
ii) Mandatory Security Mechanism: used to enforce multilevel security by classifying the data
and users into various security classes/levels and then implementing the appropriate security policy
of the organisation. E.g a typical security policy is to permit users at certain classification level to
see only the data items classified at the users own or lower classification level. An extension of this
is Role-based Security, which enforces policies and privileges based on concepts of roles.
Database Security and the DBA
DBA is the central authority for managing a database system. The DBAs responsibilities include
granting privileges to users who need to use the system and classifying users and data in accordance
with the policy of the organisation. The DBA has a DBA account in DBMS, sometimes called a
System or Superuser Account, which provides powerful capabilities that are not made available to
regular database accounts and users. DBA-privileged commands include commands for granting
and revoking privileges to individual accounts or user groups and for performing the following
types of actions:
~. Account Creation: this action creates a new account and password for a user or
group of users to enable access to the DBMS.
~. Privilege Granting: this action permits the DBA to grant certain privileges to
certain accounts.
~. Privilege Revocation: this action permits the DBA to revoke certain privileges that
were previously given to certain accounts.
~. Security Level Assignment: this action consists of assigning user accounts to the
appropriate security classification level.
The DBA is responsible for the overall security of the database system. Action i above is used to
control access to the DBMS as a whole, whereas actions ii and iii are used to control Discretionary
database authorization and action iv is used to control Mandatory authorization.
Access Protection, User Accounts and Database Audits
DBA will create a new account number and password for the user if there is a legitimate need to
access the database. The user must log in to the DBMS by entering the account number and
password whenever database access is needed.
Its straightforward to keep track of all database users and their accounts and passwords by creating
and encrypted table or file with two fields: Account Number and Password. This table can be easily
maintained by the DBMS.
The database system must also keep track of all operations on the database that are applied by a
certain user throughout each login session, which consists of the sequence of database interactions
that a user performs from the time of logging in to the time of logging off.
To keep a record of all updates applied to the database and of particular users who applied each
update, we can modify system log which includes an entry for each operation applied to the
database that may be required for recovery from a transaction failure or system crash. If any
tampering with the database is suspected, a database audit is performed, which consists of
reviewing the log to examine all access and operations applied to the database during a certain time
period. When an illegal or unauthorized operation is found, the DBA can determine the account
number used to perform the operation. Database audits are particularly important for sensitive
databases that are updated by many transactions and users such as a banking database that is
updated by many bank tellers. A database log that is used mainly for security purposes is sometimes
called an Audit Trail.
Discretionary Access Control based on Granting and Revoking Privileges
The typical method of enforcing discretionary access control in a database system is based on the
granting and revoking privileges.
Types of Discretionary Privileges:
The Account Level: at this level, the DBA specifies the particular privileges that each account
holds independently of the relations in the database.
The privileges at the account level apply to the capabilities provided to the account itself and can
include the CREATE SCHEMA or CREATE TABLE privilege, to create a schema or base
relations; the CREATE VIEW privilege; the ALTER privilege, to apply schema changes such
adding or removing attributes from relations; the DROP privilege, to delete relations or views; the
MODIFY privilege, to insert, delete, or update tuples; and the SELECT privilege, to retrieve
information from the database by using a SELECT query.
The Relation (or Table) Level: at this level, the DBA can control the privileges to access each
individual relation or view in the database.
The second level of privileges applies to the relation level, whether they are base relations or
virtual (view) relations.
The granting and revoking of privileges generally follow an authorization model for discretionary
privileges known as Access Matrix model, where the rows of a matrix M represents subjects
(users, accounts, programs) and the columns represent objects (relations, records, columns, views,
operations). Each position M(i,j) in the matrix represents the types of privileges ( read, write,
update) that subject I holds on object j
To control the granting and revoking of relation privileges, each relation R in a database is assigned
an owner account which is typically the account that was used when the relation was created in the
first place. The owner of a relation is given all privileges on that relation. In SQL2, the DBA can
assign and owner to a whole schema by creating the schema and associating the appropriate
authorization identifier with that schema using the CREATE SCHEMA command. The owner
account holder can pass privileges on any of the owned relation to other users by granting privileges
to their accounts.
In SQL the following types of privileges can be granted on each individual relation R:
SELECT (retrieval or read) privilege on R: gives the account retrieval privilege. In SQL
this gives the account the privilege to use the SELECT statement to retrieve tuples from R
MODIFY privilege on R: this gives the account the capability to modify tuples of R. in
SQL this privilege is divided into UPDATE, DELETE and INSERT privileges to apply the
corresponding SQL command to R. additionally both the INSERT and UPDATE privileges
can specify that only certain attributes of R can be updated by the account.
REFERNCES privilege on R: this gives the account the capability to reference relation R
when specifying integrity constraints. This privilege can also be restricted to specific
attributes of R.
Notice that to create a view, the account must have SELECT privilege on all relations involved in
the view definition.
Specifying Privileges using Views
The mechanism of views is an important discretionary authorization mechanism in its own right.
For example, if the owner A of a relation R wants another account B to be able to retrieve only
some fields of R, then A can create a view V of R that includes only those attributes and then grant
SELECT on V to B. The same applies to limiting B to retrieving only certain tuples of R; a view V
can be created by defining the view by means of a query that selects only those tuples from R that A
wants to allow B to access.
Revoking Privileges:
In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a
relation may want to grant the SELECT privilege to a user for a specific task and then revoke that
privilege once the task is completed. Hence, a mechanism for revoking privileges is needed. In
SQL, a REVOKE command is included for the purpose of canceling privileges.
Propagation of privileges using the GRANT OPTION
Whenever the owner A of a relation R grants a privilege on R to another account B, privilege can be
given to B with or without the GRANT OPTION. If the GRANT OPTION is given, this means that
B can also grant that privilege on R to other accounts. Suppose that B is given the GRANT
OPTION by A and that B then grants the privilege on R to a third account C, also with GRANT
OPTION. In this way, privileges on R can propagate to other accounts without the knowledge of
the owner of R. If the owner account A now revokes the privilege granted to B, all the privileges
that B propagated based on that privilege should automatically be revoked by the system.
It is possible for a user to receive a certain privilege from two or more sources. e .g A4 may receive
a certain UPDATE R privilege from both A2 and A3. In such a case, if A2 revokes this privilege
from A4, A4 will still continue to have the privilege by virtue of having been granted it from A3. if
A3 later revokes the privilege from A4, A4 totally loses the privilege. Hence a DBMS that allows
propagation of privilege must keep track of how all the privileges were granted so that revoking of
privileges can be done correctly and completely.
Specifying Limits on Propagation of Privileges
Techniques to limit the propagation of privileges have been developed, although they have not yet
been implemented in most DBMSs and are not a part of SQL.
Limiting horizontal propagation to an integer number i means that an account B given the
GRANT OPTION can grant the privilege to at most i other accounts.
Vertical propagation is more complicated; it limits the depth of the granting of privileges.
Granting a privilege with a vertical propagation of zero is equivalent to granting the privilege with
no GRANT OPTION. If account A grants a privilege to account B with the vertical propagation set
to an integer number j>0, this means that the account B has the GRANT OPTION on that privilege,
but B can grant the privilege to other accounts only with a vertical propagation less than j.
Mandatory Access Control and Role-Based Access Control for Multilevel Security
The discretionary access control techniques of granting and revoking privileges on relations has
traditionally been the main security mechanism for relational database systems.
This is an all-or-nothing method: A user either has or does not have a certain privilege.
In many applications, and additional security policy is needed that classifies data and users based
on security classes. This approach known as mandatory access control, would typically be
combined with the discretionary access control mechanisms. It is important to note that most
commercial DBMS currently provide mechanisms only for discretionary access control.
Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U),
where TS is the highest level and U the lowest: TS S C U
The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies
each subject (user, account, program) and object (relation, tuple, column, view, operation) into one
of the security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and
to the classification of an object O as class(O).
Two restrictions are enforced on data access based on the subject/object classifications:
1. A subject S is not allowed read access to an object O unless class(S) class(O). This is
known as the Simple Security Property.
2. A subject S is not allowed to write an object O unless class(S) class(O). This known as the
Star Property (or * property).
The first restriction is intuitive and enforces the obvious rule that no subject can read an object
whose security classification is higher than the subjects security clearance.
The second restriction is less intuitive. It prohibits a subject from writing an object at a lower
security classification than the subjects security clearance.
Violation of this rule would allow information to flow from higher to lower classifications which
violates a basic tenet of multilevel security.
To incorporate multilevel security notions into the relational database model, it is common to
consider attribute values and tuples as data objects. Hence, each attribute A is associated with a
classification attribute C in the schema, and each attribute value in a tuple is associated with a
corresponding security classification. In addition, in some models, a tuple classification attribute
TC is added to the relation attributes to provide a classification for each tuple as a whole. Hence, a
multilevel relation schema R with n attributes would be represented as
R(A1,C1,A2,C2, , An,Cn,TC)
where each Ci represents the classification attribute associated with attribute Ai.
The value of the TC attribute in each tuple t which is the highest of all attribute classification
values within t provides a general classification for the tuple itself, whereas each Ci provides a
finer security classification for each attribute value within the tuple.
The apparent key of a multilevel relation is the set of attributes that would have formed the
primary key in a regular (single-level) relation.
A multilevel relation will appear to contain different data to subjects (users) with different clearance
levels. In some cases, it is possible to store a single tuple in the relation at a higher classification
level and produce the corresponding tuples at a lower-level classification through a process known
as Filtering.
In other cases, it is necessary to store two or more tuples at different classification levels with the
same value for the apparent key. This leads to the concept of Polyinstantiation where several
tuples can have the same apparent key value but have different attribute values for users at different
classification levels.
In general, the entity integrity rule for multilevel relations states that all attributes that are
members of the apparent key must not be null and must have the same security classification within
each individual tuple.
In addition, all other attribute values in the tuple must have a security classification greater than or
equal to that of the apparent key. This constraint ensures that a user can see the key if the user is
permitted to see any part of the tuple at all. Other integrity rules, called Null Integrity and
Interinstance Integrity, informally ensure that if a tuple value at some security level can be
filtered from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the
multilevel relation.
Comparing Discretionary Access Control and Mandatory Access Control
Discretionary Access Control (DAC) policies are characterized by a high degree of
flexibility, which makes them suitable for a large variety of application domains.
The main drawback of DAC models is their vulnerability to malicious attacks, such as
Trojan horses embedded in application programs.
By contrast, mandatory policies ensure a high degree of protection in a way, they prevent
any illegal flow of information.
Mandatory policies have the drawback of being too rigid and they are only applicable in
limited environments.
In many practical situations, discretionary policies are preferred because they offer a better
trade-off between security and applicability.
Role-Based Access Control
Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for
managing and enforcing security in large-scale enterprise-wide systems. Its basic notion is that
permissions are associated with roles, and users are assigned to appropriate roles. Roles can be
created using the CREATE ROLE and DESTROY ROLE commands. The GRANT and REVOKE
commands discussed under DAC can then be used to assign and revoke privileges from roles.
RBAC appears to be a viable alternative to traditional discretionary and mandatory access controls;
it ensures that only authorized users are given access to certain data or resources. Role hierarchy in
RBAC is a natural way to organize roles to reflect the organizations lines of authority and
responsibility. Another important consideration in RBAC system is the possible temporal
constraints that may exist on roles, such as the time and duration of role activations and timed
triggering of a role by an activation of another role. RBAC model is a highly desirable goal for
addressing the key security requirements of Web-based applications.
RBAC models have several desirable features such as flexibility, policy neutrality, better support
for security management and administration and other aspects that make them attractive candidates
for developing secure Web-based applications. RBAC models can represent traditional DAC and
MAC policies as well as user-defined or organization-specific policies.
RBAC model provides a natural mechanism for addressing the security issues related to the
execution of tasks and workflows. Easier deployment over the internet has been another reason for
the success of RBAC models.
Access Control Policies for E-commerce and the Web
E-Commerce environments require elaborate policies that go beyond traditional DBMSs.
In conventional database environments, access control is usually performed using a set of
authorizations stated by security officers or users according to some security policies. Such
a simple paradigm is not well suited for a dynamic environment like e-commerce.
In an e-commerce environment the resources to be protected are not only traditional
data but also knowledge and experience. Such peculiarities call for more flexibility
in specifying access control policies.
The access control mechanism should be flexible enough to support a wide spectrum
of heterogeneous protection objects.
A second related requirement is the support for content-based access-control. Content-based
access control allows one to express access control policies that take the protection object
content into account. In order to support content-based access control, access control
policies must allow inclusion of conditions based on the object content.
A third requirement is related to heterogeneity of subjects, which requires access control
policies based on user characteristic and specifications rather than on specific and individual
characteristic. e. g user IDs.
Credential is a set of properties concerning a user that are relevant for security purposes.
It is believed that the XML lang. can play a key role in access control for e-commerce
applications because XML is becoming the common representation lang. for document
interchange over the web and is also becoming the lang. for e-commerce.
Statistical Database Security
Statistical databases are used mainly to produce statistics on various populations.
The database may contain confidential data on individuals, which should be protected from
user access.
Users are permitted to retrieve statistical information on the populations, such as averages,
sums, counts, maximums, minimums, and standard deviations.
A population is a set of tuples of a relation (table) that satisfy some selection condition.
Statistical queries involve applying statistical functions to a population of tuples.
For example, we may want to retrieve the number of individuals in a population or the
average income in the population. However, statistical users are not allowed to retrieve
individual data, such as the income of a specific person. Statistical database security
techniques must prohibit the retrieval of individual data.
This can be achieved by prohibiting queries that retrieve attribute values and by allowing
only queries that involve statistical aggregate functions such as COUNT, SUM, MIN, MAX,
AVERAGE, and STANDARD DEVIATION. Such queries are sometimes called Statistical
Queries.
It is DBMSs responsibility to ensure confidentiality of information about individuals, while
still providing useful statistical summaries of data about those individuals to users.
Provision of privacy protection of users in a statistical database is paramount.
In some cases it is possible to infer the values of individual tuples from a sequence statistical
queries. This is particularly true when the conditions result in a population consisting of a
small number of tuples.
Flow Control
Flow control regulates the distribution or flow of information among accessible objects. A
flow between object X and object Y occurs when a program reads values from X and writes
values into Y.
Flow controls check that information contained in some objects does not flow explicitly or
implicitly into less protected objects.
A flow policy specifies the channels along which information is allowed to move. The
simplest flow policy specifies just two classes of information: confidential (C) and
nonconfidential (N), and allows all flows except those from class C to class N. This policy
can solve the confinement problem that arises when a service program handles data such as
customer information, some of which may be confidential.
Flow controls can be enforced by an extended access control mechanism, which involves
assigning a security class (usually called the clearance) to each running program.
Flow control mechanism must verify that only authorized flows, both explicit and implicit
are executed. A set of rules must be satisfied to ensure secure information flow.
Covert Channels
Covert channel allows a transfer of information that violates the security or the policy. It allows
information to pass from a higher classification level to a lower classification level through
improper means.
Covert channels can be classified into two broad categories: timing channel and storage.
In a Timing Channel the information is conveyed by the timing of events or processes whereas
Storage Channels do not require any temporal synchronization, in that information is conveyed by
accessing system information or what is otherwise inaccessible to the user.
Encryption and Public key Infrastructures
Encryption is a means of maintaining secure data in an insecure environment. Encryption consists
of applying an encryption algorithm to data using some prespecified encryption key. The resulting
data has to be decrypted using a decryption key to recover the original data.
The Data and Advanced Encryption Standards
The Data Encryption Standard (DES) is a system developed by the U.S government for use by the
general public. It has been widely accepted as cryptographic standard both in the United States and
abroad. DES can provide end-to-end encryption on the channel between sender A and receiver B.
The DES algorithm is a careful and complex combination of two of the fundamental building
blocks of encryption: substitution and permutation (transposition)
Public Key Encryption
The two keys used for public key encryption are referred to as the public key and the private key.
Invariably, the private key is kept secret, but it is referred to as a private key rather than a secret
key (the key used in conventional encryption) to avoid confusion with conventional encryption.
Public key encryption refers to a type of cypher architecture known as public key cryptography that
utilizes two keys, or a key pair), to encrypt and decrypt data. One of the two keys is a public key,
which anyone can use to encrypt a message for the owner of that key. The encrypted message is
sent and the recipient uses his or her private key to decrypt it. This is the basis of public key
encryption.
Other encryption technologies that use a single shared key to both encrypt and decrypt data rely on
both parties deciding on a key ahead of time without other parties finding out what that key is. This
type of encryption technology is called symmetric encryption, while public key encryption is
known as asymmetric encryption.
The public key of the pair is made public for others to use, whereas the private key is known only to
its owner.
Public key encryption scheme or infrastructure has six ingredients:
i. Plaintext: data or readable message that is fed into algorithm as input.
ii. Encryption algorithm: this algorithm performs various transformations on the plaintext.
iii. Public and iv private keys: these are a pair of keys that have been selected so that if one
is
used for encryption, the other is used for decryption.
v. Ciphertext: this scramble message produced as output. It is depends on the plaintext and
the key. For a given message, two different keys will produce two different cipher texts.
vi. Decryption algorithm: this algorithm accepts the ciphertext and the matching key and
produces the original plaintext.
A "key" is simply a small bit of text code that triggers the associated algorithm to encode or
decode text. In public key encryption, a key pair is generated using an encryption program and the
pair is associated with a name or email address. The public key can then be made public by posting
it to a key server, a computer that hosts a database of public keys.
Public key encryption can also be used for secure storage of data files. In this case, your public key
is used to encrypt files while your private key decrypts them.
User Authentication: is a way of identifying the user and verifying that the user is allowed to
access some restricted data or application. This can be achieved through the use of passwords and
access rights.
Methods of attacking a distributed systems
- Eavesdropping: is the act of surreptitiously listening to a private conversation.
- Masquerading
- Message tampering
- Replaying
- Denial of service: A denial-of-service attack (DoS attack) or distributed denial-of-
service attack (DDoS attack) is an attempt to make a computer resource unavailable to its
intended users. Although the means to carry out, motives for, and targets of a DoS attack
may vary, it generally consists of the concerted efforts of a person or persons to prevent an
Internet site or service from functioning efficiently or at all, temporarily or indefinitely.
Perpetrators of DoS attacks typically target sites or services hosted on high-profile web
servers such as banks, credit card payment gateways, and even root nameservers
- phishing. Phishers use electronic communications that look as if they came from
legitimate banks or other companies to persuade people to divulge sensitive information,
including passwords and credit card numbers.
Why Cryptography is necessary in a Distributed System
Supporting the facilities of a distributed system, such as resource distribution, requires the use of an underlying message
passing system. Such systems are, in turn, reliant on the use of a physical transmission network, upon which the
messages may physically be communicated between hosts.
Physical networks and, therefore, the basic message passing systems built over them are vulnerable to attack. For
example, hosts may easily attach to the network and listen in on the messages (or 'conversations') being held. If the
transmissions are in a readily understandable form, the eavesdroppers may be able to pick out units of information, in
effect stealing their information content.
Aside from the theft of user data, which may be in itself of great value, there may also be system information being
passed around as messages. Eavesdroppers from both inside and outside the system may attempt to steal this system
information as a means of either breaching internal access constraints, or to aid in the attack of other parts of the
system. Two possibly worse scenarios may exist where the attacking system may modify or insert fake transmissions on
the network. Accepting faked or modified messages as valid could lead a system into chaos.
Without adequate protection techniques, Distributed Systems are extremely vulnerable to the standard types of attack
outlined above. The encryption techniques discussed in the remainder of this report aim to provide the missing
protection by transforming a message into a form where if it were intercepted in transit, the contents of the original
message could not be explicitly discovered. Such encrypted messages, when they reach their intended recipients,
however, are capable of being transformed back into the original message.
There are two main frameworks in which this goal may be achieved, they are named Secret Key Encryption Systems
and Public Key Encryption Systems.
Secret Key Encryption Systems
Secret key encryption uses a single key to both encrypt and decrypt messages. As such it must be present at both the
source and destination of transmission to allow the message to be transmitted securely and recovered upon receipt at the
correct destination. The key must be kept secret by all parties involved in the communication. If the key fell into the
hands of an attacker, they would then be able to intercept and decrypt messages, thus thwarting the attempt to attain
secure communications by this method of encryption.
Secret key algorithms like DES assert that even although it is theoretically possible to derive the secret key from the
encrypted message alone, the quantities of computation involved in doing so make any attempts infeasible with current
computing hardware. The Kerberos architecture is a system based on the use of secret key encryption.
Public Key Encryption
Public key systems use a pair of keys, each of which can decrypt the messages encrypted by the other. Provided one of
these keys is kept secret (the private key), any communication encrypted using the corresponding public key can be
considered secure as the only person able to decrypt it holds the corresponding private key.
The algorithmic properties of the encryption and decryption processes make it infeasible to derive a private key from a
public key, an encrypted message, or a combination of both. RSA is an example of a public key algorithm for
encryption and decryption. It can be used within a protocol framework to ensure that communication is secure and
authentic.
Data Privacy through Encryption
There are two aspects to determining the level of privacy that can be attained through the Kerberos and RSA systems.
To begin with, there is an analysis of the security of the two systems from an algorithmic view. The questions raised at
this stage aim to consider exactly how hard it is to derive a private or secret key from encrypted text or public keys.
Currently, one of the main secret key algorithms is DES, although two other more recent algorithms, RC2 and RC4
have also arisen. The size( i.e. length) of keys employed in processes is considered to be a useful metric when
considering the strength of cryptology. This is because, longer key sizes generally make encrypted text more difficult to
decrypt without the appropriate key.
The DES algorithm has a maximum key length of approximately 50 bits. Current consensus is that this range of key
size yields keys that are strong enough to withstand attacks using current technologies. The algorithms fixed size nature
may, however, constrain it in the future when hardware and theoretic advances are made. The RC2 and RC4 algorithms
also have bounded maximum key sizes that limit their usefulness similarly.
A major problem associated with secret key systems, however, is their need for a secure channel within which keys can
be propagated. In Kerberos, every client needs to be made aware of its secret key before it can begin communication.
To do so without giving away the key to any eavesdroppers requires a secure channel. In practice, maintaining a
channel that is completely secure is very difficult and often impractical.
A second aspect to privacy concerns how much inferential information can be obtained through the system. For
example, how much information is it possible to deduce without explicitly decrypting actual messages. One particularly
disastrous situation would be if it were possible to derive the secret or private keys without mounting attacks on public
keys or encrypted messages.
In Kerberos, there is a danger that the ability to watch a client progress through the authentication protocol is available.
Such information may be enough to mount an attack on the client by jamming the network at strategic points in the
protocol. Denial of service like this may be very serious in a time critical system.
In pure algorithmic terms, RSA is a strong. It has the ability to support much longer key lengths than DES etc. Key
length is also only limited by technology, and so the algorithm can keep step with increasing technology and become
stronger by being able to support longer key lengths.
Unlike secret key systems, the private keys of any public key system need never be transmitted. Provided local security
is strong, the overall strength of the algorithm gains from the fact that the private key never leaves the client.
RSA is susceptible to information leakage, however, and some recent theoretic work outlined an attack plan that could
infer the private key of a client based on some leaked, incidental information. Overall however, the RSA authentication
protocol is not as verbose as the Kerberos equivalent. Having fewer interaction stages limits the bandwidth of any
channel though which information may escape. A verbose protocol like Kerberos's simply gives an eavesdropper more
opportunity to listen and possibly defines a larger and more identifiable pattern of interaction to listen for.
Distributed systems require the ability to communicate securely with other computers in the
network. To accomplish this, most systems use key management schemes that require prior
knowledge of public keys associated with critical nodes. In large, dynamic, anonymous systems,
this key sharing method is not viable. Scribe is a method for efficient key management inside a
distributed system that uses Identity Based Encryption (IBE). Public resources in a network are
addressable by unique identifiers. Using this identifier as a public key, other entities are able to
securely access that resource. This paper evaluates key distribution schemes inside Scribe and
provides recommendations for practical implementation to allow for secure, efficient, authenticated
communication inside a distributed system
Parallel and Distributed Databases
In parallel system architecture, there are two main types of multiprocessor system architectures that
are commonplace:
Shared memory (tightly coupled) architecture. Multiple processors share secondary
(disk) storage and also share primary memory.
Shared disk (loosely coupled) architecture. Multiple processors share secondary (disk)
storage but each has their own primary memory.
These architectures enable processors to communicate without the overhead of exchanging
messages over network. Database mgt systems developed using the above types of architectures are
termed Parallel Database Mgt System rather than DDBMS, since they utilize parallel processor
technology. Another type of multiprocessor architecture is called shared nothing architecture. In
this architecture, every processor has its own primary and secondary (disk) memory, no common
memory exists and the processors communicate over a high-speed interconnection network.
BENEFITS OF A PARALLEL DBMS
Improves response time
Interquery parallelism
It is possible to process a number of transactions in parallel with each other.
Improves Throughput.
INTRAQUERY PARALLELISM
It is possible to process sub-tasks of a transaction in parallel with each other.
How to Measure the Benefits
Speed-Up
As you multiply resources by a certain factor, the time taken to execute a transaction should be
reduced by the same factor:
10 seconds to scan a DB of 10,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs
Scale-up.
As you multiply resources the size of a task that can be executed in a given time should be
increased by the same factor.
1 second to scan a DB of 1,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs
Characteristics of Parallel DBMSs
CPUs will be co-located.
Same machine or same building: tightly coupled.
Biggest problem:
Interference contention for memory access and bandwidth.
Shared Architectures only!
The Evolution of Distributed Database Management Systems
Distributed database management system (DDBMS) governs storage and processing of logically
related data over interconnected computer systems in which both data and processing functions are
distributed among several sites. To understand how and why the DDBMS is different from DBMS,
it is useful to briefly examined the changes in the database environment that set the stage for the
development of the DDBMS. Corporations implemented centralized database mgt systems to meet
their structured information needs. The structured information needs are well served by centralized
systems.
Basically, the use of a centralized database required that corporate data be stored in a single central
site, usually a mainframe or midrange computer. Database mgt systems based on the relational
model could provide the environment n which unstructured information needs would be met by
employing ad hoc queries. End users would be given the ability to access data when needed.
Social and technological changes that affected DB development and design:
Business operations became more decentralized geographically.
Competition increased at the global level.
Customer demands and market needs favoured a decentralised mgt style.
Rapid technological change created low-cost microcomputers with mainframe-like power.
The large number of applications based on DBMSs and the need to protect investments in
centralised DBMS software made the notion of data sharing attractive.
Those factors created a dynamic business envt in which companies had to respond quickly to
competitive and technological pressures. Two database requirements became obvious:
Rapid ad hoc data access became crucial in the quick-response decision-making envt.
The decentralized of mgt structures based on the decentralised of business units made
decentralised multiple-access and multiple-location databases a necessity.
However, the way those factors were addressed was strongly influenced by:
~. The growing acceptance of the internet
~. The increased focus on data analysis that led to data mining and data warehousing.
The decentralized DB is especially desirable because centralised DB mgt is subject to problems
such as:
Performance degradation due to a growing number of remote locations over greater
distances.
High costs associated with maintaining and operating large central database systems.
Reliability problems created by dependence on a central site.
Dynamic business environment and centralized databases shortcomings spawned a demand for
applications based on data access from different sources at multiple locations. Such a multiple-
source/multiple-location database envt is managed by a distributed DB mgt system (DBMS).
DDBMS Advantages and Disadvantages
Advantages:
I. Data are located near the greatest demand site. The data in a distributed DB system are
dispersed to match business requirements.
II. Faster data access: end users often work with only a locally stored subset of the companys
data.
III. Faster data processing: spreads out the systems workload by processing data at several
sites.
IV. Growth facilitation: new site can be added to the network without affecting the operations
of other sites.
V. Improved communications: local sites are smaller and located closer to customers, local
sites foster better communication among departments and between customers and company
staff.
VI. Reduced operating costs: development work is done more cheaply and more quickly on
low-cost PCs than on mainframes.
VII. User-friendly interface: the GUI simplifies use and training for end users.
VIII. Less danger of a single-point failure.
IX. Processor independence: the end user is able to access any available copy of the data and
an end users request is processed by any processor at the data location
Disadvantages:
1. Complexity of mgt and control: applications must recognise data location and they must
be able to stitch together data from diff sites. DBA must have the ability to coordinate DB
activities to prevent DB degradation due to data anomalies.
2. Security: the probability of security lapses increases when data are located at multiple sites.
3. Lack of standards: there re no standard communication protocols at the DB level.
4. Increased storage requirements: multiple copies of data re required at diff sites, thus
requiring additional disks storage space.
5. Increased training cost: generally higher in a distributed model than they would be in a
centralised model.
Distributed Processing and Distributed Databases
In distributed processing, a DBs logical processing is shared among two or more physically
independent sites that re connected through a network.
A distributed Database, on the other hand, stores a logically related DB over two or more
physically independent sites. In contrast, the distributed processing system uses only a single-site
DB but shares the processing chores among several sites. In a distributed database systems, a DB is
composed of several parts known as Database Fragments. An example of a distributed DB envt is
shown below:
The DB is divided into three database fragments (E1, E2 and E3) located at diff sites. The
computers are connected through a network system. The users Alan, Betty and Hernado do not need
to know the name or location of each fragment in order to access the DB. As you examine and
contrast fig 14.2 and 14.3 you should keep in mind that:
Distributed processing does not require a distributed DB, but a distributed DB requires
distributed processing.
Distributed processing may be based on a single DB located on a single computer.
Both distributed processing and distributed DBs require a network to connect all
components.
Characteristics of DDBMS
i. Application interface to interact with the end user.
ii. Validation to analyse data requests.
iii. Transformation to determine which data request components are distributed
and which are local.
iv. Query optimisation to find the best access strategy.
v. Mapping to determine the data location of local and remote fragments.
vi. I/O interface to read or write data from or to permanent local storage.
vii. Formatting to prepare the data for presentation to the end user
viii. Security to provide data privacy at both local and remote DBs
ix. Concurrency control to manage simultaneous data access and to ensure data
consistency across DB fragments in the DBMS.
DDBMS Components
+ Computer workstations (sites or nodes) that form the network system.
+ Network hardware and software components that reside in each workstation.
+ Communications media that carry the data from one workstation to another.
+ Transaction processor TP, which is the software component found in each computer that
requests data. The TP receives and processes the applications data requests. The TP is also
known as Application Processor (AP) or the transaction Manager (TM).
+ The Data Processor (DP) which is a software component residing on each computer that
stores and retrieves data located at the site. Also known as Data Manager (DM). A data
processor may even be a centralised DBMS.
Levels of Data and Process Distribution.
Current DB systems can be classified on the basis of how process distribution and data distribution
are supported. For example, a DBMS may store data in a single site (centralised DB) or in multiple
sites (Distributed DB) and may support data processing at a single site or at multiple sites. The table
below uses a simple matrix to classify DB systems according to data and process distribution.
Database System: levels of data and process distribution
Single-site Data Multiple-site Data
Single-site process Host DBMS (mainframe) Not applicable
(requires multiple processes
Multiple-site process File server
Client/server DBMS (LAN
DBMS)
Fully distributed
Client/server DDBMS
Single-site Processing, Single-site Data (SPSD): all processing is done on a single CPU or host
computer and all data re stored on the host computers local disk. Processing cannot be done on the
end users side of the system. The functions of TP and the DP are embedded within the DBMS
located on a single computer. All data storage and data processing are handled by a single CPU.
Multiple-site Processing, Single-site Data (MPSD):
Multiple processed run on diff computers sharing a single data repository. MPSD scenario requires
a network file server running conventional applications that re accessed through a LAN.
Note
- The TP on each workstation acts only as a redirector to route all network data requests to
the file server.
- The end user sees the file server as just another hard disk.
- The end user must make a direct reference to the file server in order to access remote
data. All record- and file-locking activity is done at the end-user location.
- All data selection, search and update functions take place at the workstation.
Multiple-site Processing, Multiple-site Data (MPMD):
This describes a fully distributed database management system with support for multiple data
processors and transaction processors at multiple sites. Classified as either homogeneous or
heterogeneous
Homogeneous DDBMSs Integrate only one type of centralized DBMS over a network.
Heterogeneous DDBMSs Integrate different types of centralized DBMSs over a network
Fully heterogeneous DDBMS Support different DBMSs that may even support different data
models (relational, hierarchical, or network) running under different computer systems, such as
mainframes and microcomputers.
Distributed Database Transparency Features:
This has the common property of allowing the end user to feel like the DBs only user. The use
believes that (s)he is working with centralised DBMS; all complexities of a distributed DB re
hidden or transparent to the user. The features are:
~. Distribution Transparency: which allows a distributed DB to be treated as a single
logical DB.
~. Transaction Transparency: which allows a transaction to update data at several
network sites.
~. Failure Transparency: which ensures that the system will continue to operate in the
event of a node failure.
~. Performance Transparency: which allows the system to perform as if it were a
centralised DBMS. It also ensures that system will find the most cost-effective path to
access remote data.
~. Heterogeneity Transparency: which allows the integration of several diff local
DBMSs under a common or global schema.
Distribution Transparency: Allows management of physically dispersed database as though it
were a centralized database. Three levels of distribution transparency are recognized:
Fragmentation transparency: the highest level of transparency. The end user or programmer
does not need to know that the a DB is partitioned
Location transparency: exists when the end user or programmer must specify the DB
fragment names but does not need to specify where those fragments are located.
Local mapping transparency: exists when the end user or programmer must specify both the
fragment names and their locations.
Transaction Transparency: Ensures database transactions will maintain distributed databases
integrity and consistency. Transaction transparency ensures that the transactions are completed only
when all DB sites involved in the transaction complete their part of the transaction.
Distributed Requests and Distributed Transactions
Distributed transaction
Can update or request data from several different remote sites on network
Remote request
Lets single SQL statement access data to be processed by single remote database
processor
Remote transaction
Accesses data at single remote site
Distributed transaction Allows transaction to reference several different (local or remote) DP
sites
Distributed request Lets single SQL statement reference data located at several different local or
remote DP sites. Because each request (SQL statement) can access data from more than one local or
remote DP site, a transaction can access several sites.
Distributed Concurrency Control
Concurrency control becomes especially important in the distributed envt because multisite,
multiple-process operations are much more likely to create data inconsistencies and deadlocked
transactions than are single-site systems. For example, the TP component of a DBMS must ensure
that all parts of the transaction are completed at all sites before a final COMMIT is used to record
the transaction.
Performance Transparency and Query Optimization
Because all data reside at a single site in a centralised DB, the DBMS must evaluate every data
request and find the most efficient way to access the local data. In contrast, the DDBMS makes it
possible to partition a DB into several fragments, thereby rendering the query translation more
complicated because the DDBMS must decide which fragment of the DB to access. The objective
of a query optimization routine is to minimise the total cost associated with the execution of a
request.
One of the most important characteristics of query optimisation in distributed DB system is that it
must provide distribution transparency as well as Replica transparency. Replica Transparency
refers to the DDBMSs ability to hide the existence of multiple copies of data from the user.
Operation modes can be classified as manual or automatic. Automatic query Optimization
means that the DDBMS finds the most cost-effective access path without user intervention. Manual
query optimisation requires that the optimisation be selected and scheduled by the end or
programmer.
Query optimisation algorithms can also be classified as:
Static query optimisation: takes place at the compilation time. It creates the plan necessary to
access the DB. When the program is executed, the DBMS uses that plan to access the DB.
Dynamic query optimisation: takes place at execution time. DB access strategy is defined when
the program is executed. Its efficient; its cost is measured by run-time processing overhead. The
best strategy is determined every time the query is executed, this could happen several times in the
same program.
Distributed Database Design
Data fragmentation deals with How to partition database into fragments
Data replication deals with which fragments to replicate
Data allocation Where to locate those fragments and replicas.
Data Fragmentation
Breaks single object into two or more segments or fragments
Each fragment can be stored at any site over computer network
Information about data fragmentation is stored in distributed data catalog (DDC), from which it is
accessed by TP to process user requests.
Three Types of data fragmentation Strategies
Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows).
Each fragment is stored at a different node, and each node has a unique rows. However, the
unique rows all have the same attributes (columns). In short, each fragment represents the
equivalent of a SELECT statement, with the WHERE clause on a single attribute.
Vertical fragmentation: Division of a relation into attribute (column) subsets. Each subset
(fragment) is stored on diff node and each fragment has unique columns with the
exception of the key column, which is common to all fragments. This is equivalent of the
PROJECT statement.
Mixed fragmentation: Combination of horizontal and vertical strategies. In other words, a
table may be divided into several horizontal subsets (rows), each one having a subset of the
attributes (columns).
Data Replication
This refers to the storage of data copies at multiple sites served by computer network.
Fragment copies can be stored at several sites to serve specific information requirements
Can enhance data availability and response time
Can help to reduce communication and total query costs.
Replicated data are subject to the mutual consistency rule. The mutual consistency rule requires
that all copies of data fragments be identical. Therefore to maintain data consistency among the
replicas, the DDBMS must ensure that a DB update is performed at all sites where replicas exist.
Three Replication scenarios exist: a DB can be:
Fully replicated database - Stores multiple copies of each database fragment at multiple
sites. It can be impractical due to amount of overhead it imposes on the system.
Partially replicated database - Stores multiple copies of some database fragments at
multiple sites. Most DDBMSs are able to handle the partially replicated database well.
Unreplicated database: Stores each database fragment at single site. No duplicate database
fragments.
Several factors influence the decision to use data replication:
Database size
Usage frequency
Costs
Data Allocation
It describes the process of deciding where to locate data. Data allocation strategies as follows:
With Centralized data allocation, the entire database is stored at one site
With Partitioned data allocation, the database is divided into several disjointed parts
(fragments) and stored at several sites.
With Replicated data allocation, Copies of one or more database fragments are stored at
several sites.
Data distribution over computer network is achieved through data partition, data replication, or
combination of both. Data allocation is closely related to the way a database is divided or
fragmented.
Data allocation algorithms take into consideration a variety of factors, including:
- Performance and data availability goals.
- Size, number of rows and number of relations that an entity maintains with other
entities.
- Types of transactions to be applied to the DB and the attributes accessed by each of
those transactions.
Client/Server vs. DDBMS
It refers to the way in which computers interact to form system. The architecture features user of
resources, or client, and provider of resources, or server. The client/server architecture can be used
to implement a DBMS in which client is the TP and server is the DP.
Client/server advantages
Less expensive than alternate minicomputer or mainframe solutions
Allow end user to use microcomputers GUI, thereby improving functionality and simplicity
More people in job market have PC skills than mainframe skills
PC is well established in workplace.
Numerous data analysis and query tools exist to facilitate interaction with DBMSs available
in PC market
There is a considerable cost advantage to offloading applications development from
mainframe to powerful PCs.
Client/server disadvantages
Creates more complex environment in which different platforms (LANs, operating system
etc) are often difficult to manage.
An increase in number of users and processing sites often paves the way for security
problems.
The C/S envt makes it possible to spread data access to much wider circle of users. Such and
envt increases demand for people with broad knowledge of computers and software
applications. The burden of training increases cost of maintaining the environment.
C. J. Dates Twelve Commandments for Distributed Databases
1. Local site independence. Each local site can act as an independent, autonomous, centralized
DBMS. Each site is responsible for security, concurrency control backup and recovery.
2. Central site independence. No site in the network relies on a central site or any other site.
All sites have the same capabilities.
3. Failure independence. The system is not affected by node failures.
4. Location transparency. The user does not need to know the location of the data in order to
retrieve those data
5. Fragmentation transparency. The user sees only one logical DB. Data fragmentation is
transparent to the user. The user does not need to know the name of the DB fragments in
order to retrieve them.
6. Replication transparency. The user sees only one logical DB. The DDBMS transparently
selects the DB fragment to access.
7. Distributed query processing. A distributed query may be executed at several different DP
sites.
8. Distributed transaction processing. A transaction may update data at several different sites.
The transaction is transparently executed at several diff DP sites.
9. Hardware independence. The system must run on any hardware platform.
10. Operating system independence. The system must run on any OS software platform.
11. Network independence. The system must run on any network platform.
12. Database independence. The system must support any vendors DB product.
Two-phase commit protocol
Two-phase commit is a standard protocol in distributed transactions for achieving ACID properties. Each
transaction has a coordinator who initiates and coordinates the transaction.
In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for
their answers. The coordinator then sends their answers to all other sites. Every participant waits for these
answers from the coordinator before committing to or aborting the transaction. If committing, the
coordinator records this into a log and sends a commit message to all participants. If for any reason a
participant aborts the process, the coordinator sends a rollback message and the transaction is undone using
the log file created earlier. The advantages of this are all participants reach a decision consistently, yet
independently.
However, the two-phase commit protocol also has limitations in that it is a blocking protocol. For example,
participants will block resource processes while waiting for a message from the coordinator. If for any
reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the
resource could be blocked indefinitely. On the other hand, a coordinator will also block resources while
waiting for replies from participants. In this case, a coordinator can also block indefinitely if no
acknowledgement is received from the participant. This is most likely the reason why systems still use the
two-phase commit protocol.
Three-phase commit protocol
An alternative to the two-phase commit protocol used by many database systems is the three-phase commit.
Dale Skeen describes the three-phase commit as a non blocking protocol. He then goes on to say that it was
developed to avoid the failures that occur in two-phase commit transactions.
As with the two-phase commit, the three-phase also has a coordinator who initiates and coordinates the
transaction. However, the three-phase protocol introduces a third phase called the pre-commit. The aim of
this is to remove the uncertainty period for participants that have committed and are waiting for the global
abort or commit message from the coordinator. When receiving a pre-commit message, participants know
that all others have voted to commit. If a pre-commit message has not been received the participant will
abort and release any blocked resources.
Review Questions
1. Describe the evolution from centralized DBMS to distributed DBMS.
2. List and discuss some of the factors that influenced the evolution of the DBMS.
3. What are the advantages and disadvantages of DBMS?
4. Explain the difference between a distributed DB and distributed processing.
5. What is a fully distributed DB mgt system?
6. What are the components of a DDBMS
7. Explain the transparency features of a DBMS.
8. Define and explain the different types of distribution transparency.
9. Explain the need for the two-phase commit protocol. Then describe the two phases.
10. What is the objective of query optimisation function?.
11. To which transparency feature are the query optimisation functions related?
12. What are different types of query optimisation algorithms?
13. Describe the three data fragmentation strategies. Give some examples.
14. What is data replication and what are the three replication strategies?
15. Explain the difference between file server and client/server architecture.
Data Warehouse
The Need for Data Analysis
G Managers must be able to track daily transactions to evaluate how the business is
performing.
G By tapping into operational database, management can develop strategies to meet
organizational goals.
G Data analysis can provide information about short-term tactical evaluations and strategies.
Given the many and varied competitive pressures, managers are always looking for a competitive
advantage through product development and maintenance, service, market positioning, sales
promotion. In addition, the modern business climate requires managers to approach increasingly
complex problems that involve a rapidly growing number of internal and external variables.
Different managerial levels require different decision support needs. Managers require detailed
information designed to help them make decisions in a complex data and analysis environment. To
support such decision making, information systems IS depts. have created decision support systems
or DSSs.
Decision support System
Decision support is methodology (or series of methodologies) designed to extract
information from data and to use such information as a basis for decision making
Decision support system (DSS)
Arrangement of computerized tools used to assist managerial decision making within
business
Usually requires extensive data massaging to produce information
Used at all levels within organization
Often tailored to focus on specific business areas or problems such as finance,
insurance, banking and sales.
The DSS is interactive and provides ad hoc query tools to retrieve data and to display
data in different formats.
Keep in mind that managers must initiate decision support process by asking the appropriate
questions. The DSS exists to support the manager; it does not replace the mgt function.
DSS is composed of following four main components:
Data store component: Basically a DSS database. The data store contains two or main
types of data: business data and business model data. The business data are extracted from
the operational DB and from external data sources. The external data source provides data
that cannot be found within the company. The business models are generated by special
algorithms that model the business to identify and enhance the understanding of business
situation and problems.
Data extraction and data filtering component: Used to extract and validate data taken
from operational database and external data sources. For example, to determine the relative
market share by selected product line, the DSS requires data from competitors products.
Such data can be located in external DBs provided by the industry groups or by companies
that market the data. This component extracts the data, filters the extracted data to select the
relevant records, and packages the data in the right format to be added to the DSS data store
component.
End-user query tool: Used by data analyst to create queries that access database.
Depending on the DSS implementation, the query tool accesses either the operational DB or
more commonly, the DSS DB. This tool advises the user on which data to select and how to
build a reliable business data model.
End-user presentation tool: Used to organize and present data. This also helps the end user
select the most appropriate presentation format such as summary report or mixed graphs.
Although the DSS is used at strategic and tactical managerial levels within organization, its
effectiveness depends on the quality of data gathered at the operational level.
Operational Data vs. Decision Support Data
Operational Data
Mostly stored in relational database in which the structures (tables) tend to be highly
normalized.
Operational data storage is optimized to support transactions representing daily
operations.
DSS Data
Give tactical and strategic business meaning to operational data.
Differs from operational data in following three main areas:
G Timespan: operational data covers a short time frame.
G Granularity (level of aggregation) DSS data must be presented at different
levels of aggregation from highly summarized to near atomic.
G Dimensionality: operational data focus on representing individual
transactions rather than on the effects of the transactions over time.
Difference Between Operational and DSS Data
Characteristic Operational Data DSS Data
Data currency Current operations
Real-time data
Historic data, Snapshot of
company data, Time component
(week/month/year)
Granularity Atomic-detailed data Summarized data
Summarization level Low; some aggregate yields High; many aggregation levels
Data model Highly normalized
Mostly relational DBMS
Non-normalised,
Complex structures
Some relational, but mostly
multidimensional DBMS
Transaction type Mostly updates Mostly query
Transaction volumes High update volumes Periodic loads and summary
calculations
Transaction speed Updates are critical Retrievals are critical
Query activity Low to medium High
Query scope Narrow range Broad range
Query complexity Simple to medium Very complex
Data volumes Hundreds of megabytes, up to
gigabytes.
Hundreds of gigabytes, up to
terabytes.
DSS Database Requirements: A DSS DB is specialized DBMS tailored to provide fast answers to
complex queries.
Four main requirements:
Database schema
Data extraction and loading
End-user analytical interface
Database size
Database schema
Must support complex data representations
Must contain aggregated and summarized data
Queries must be able to extract multidimensional time slices
Data extraction
Should allow batch and scheduled data extraction
Should support different data sources
Flat files
Hierarchical, network, and relational databases
Multiple vendors
Data filtering: Must allow checking for inconsistent data or data validation rules.
End-user analytical interface
The DSS DBMS must support advanced data modeling and data presentation tools. Using those
tools makes it easy for data analysts to define the nature and extent of business problems.
The end-user analytical interface is one of most critical DSS DBMS components. When properly
implemented, an analytical interface permits the user to navigate through the data to simplify and
accelerate the decision-making process.
Database size
In 2005, Wal-Mart had 260 terabytes of data in its data warehouses.
DSS DB typically contains redundant and duplicated data to improve retrieval and simplify
information generation. Therefore, the DBMS must support very large databases (VLDBs)
The Data Warehouse
The acknowledge father of the data warehouse defines the term as an Integrated, subject-oriented,
time-variant, nonvolatile collection of data that provides support for decision making.
Integrated: the data warehouse is centralized, consolidated DB that integrates data derived from
the entire organization and form multiple sources with diverse formats. Data integration implies
that all business entities, data elements, data characteristic and business metrics are described in
the same way throughout the enterprise.
Subject-oriented: data warehouse data are arranged and optimized to provide answers to questions
coming from diverse functional areas within a company. Data warehouse data are organized and
summarized by topic such as sales, marketing.
Time-variant: warehouse data represent the flow of data through time. The data warehouse can
even contain projected data generated through statistical and other models. It is also time-variant
in the sense that once data are periodically uploaded to the data warehouse, all time-dependent
aggregations are recomputed.
Non-volatile: once data enter the data warehouse, they are never removed. Because the data in the
warehouse represent the companys history, the operatonal data, representing the near-term
history, are always added to it. Data are never deleted and new data are continually added, so the
data warehouse is always growing.
Comparison of Data Warehouse and Operational database characteristics
Characteristic Operational Database Data Data Warehouse Data
Integrated Similar data can have different
representations or meanings.
Provide a united view of all data
elements with a common
definition and representation for
all business units.
Subject-orientated Data are stored with a function
or process, orientation. For
example, data may be stored for
invoices, payments and credit
amounts.
Data are stored with a subject
orientation that facilitates
multiple views of the data and
facilitates decision making. For
example, sales may be recorded
by product, by division, by
manager or by region.
Time-variant Data are recorded as current
transactions. For example, the
sales data may be the sale of a
product on a given data.
Data are recorded with a
historical perspective in mind.
Therefore, a time dimension is
added to facilitate data analysis
and various time comparisons.
Non-volatile Data updates are frequent and
common. For example, an
inventory amount charges with
each sale. Therefore the data
environment is fluid.
Data cannot be changed. Data
are added only periodically from
historical systems. Once the data
are properly stored, no changes
are allowed. Therefore the data
environment is relatively static.
In summary data warehouse is usually a read-only database optimized for data analysis and query
processing. Creating a data warehouse requires time, money, and considerable managerial effort.
Data Warehouse properties
The warehouse is organized around the major subjects of an enterprise (e.g. customers,
products, and sales) rather than the major application areas (e.g. customer invoicing, stock
control, and order processing). Subject Oriented
The data warehouse integrates corporate application-oriented data from different source
systems, which often includes data that is inconsistent. Such data, must be made consistent
to present a unified view of the data to the users. Integrated
Data in the warehouse is only accurate and valid at some point in time or over some time
interval. Time-variance is also shown in the extended time that the data is held, the
association of time with all data, and the fact that data represents a series of historical
snapshots. Time Variant
Data in the warehouse is not updated in real-time but is refreshed from operational systems
on a regular basis. New data is always added as a supplement to the database, rather than a
replacement. Non-volatile
Data mart is a Small, single-subject data warehouse subset that provides decision support to a
small group of people. Some organization choose to implement data marts not only because of the
lower cost and shorter implementation time, but also because of the current technological advances
and inevitable people issues that make data marts attractive.
Data marts can serve as a test vehicle for companies exploring the potential benefits of data
warehouses. By migrating gradually from data marts to data warehouses, a specific depts decision
support needs can be addressed within a reasonable time frame, as compared to the longer time
frame usually required to implement a data warehouse.
The diff between a data warehouse and a data warehouse is only the size and scope of the problem
being solved.
Twelve Rules That Define a Data Warehouse
1. Data warehouse and operational environments are separated.
2. Data warehouse data are integrated.
3. Data warehouse contains historical data over long time horizon.
4. Data warehouse data are snapshot data captured at given point in time.
5. Data warehouse data are subject oriented.
6. Data warehouse data are mainly read-only with periodic batch updates from operational
data. No online updates allowed
7. Data warehouse development life cycle differs from classical systems development. The
data warehouse development is data-driven; the classical approach is process-driven.
8. Data warehouse contains data with several levels of detail: current detail data, old detail
data, lightly summarized data, and highly summarized data.
9. Data warehouse environment is characterized by read-only transactions to very large
data sets. The operational envt is characterized by numerous update transactions to a few
data entities at a time.
10. Data warehouse environment has system that traces data sources, transformations, and
storage.
11. Data warehouses metadata are critical component of this environment. The metadata
identify and define all data elements.
12. Data warehouse contains chargeback mechanism for resource usage that enforces
optimal use of data by end users.
Online Analytical Processing (OLAP), create an aadvanced data analysis environment that
supports decision making, business modeling, and operations research.
OLAP systems share four main characteristics:
Use multidimensional data analysis techniques
Provide advanced database support
Provide easy-to-use end-user interfaces
Support client/server architecture
Multidimensional Data Analysis Techniques
The most distinct characteristic of modern OLAP is their capacity for multidimensional analysis. In
multidimensional analysis:
G Data are processed and viewed as part of a multidimensional structure.
G This type of data analysis is particularly attractive to business decision makers because they
tend to view business data as data that are related to other business data.
G Multidimensional data analysis techniques are augmented by following functions:
Advanced data presentation functions: 3-D graphics, pivot tables, crosstabs,
Advanced data aggregation, consolidation and classification functions that allow the
data analyst to create multiple data aggregation levels, slice and dice data.
Advanced computational functions: business-oriented variables, financial and
accounting ratios
Advanced data modeling functions: support for what-if scenarios, variable
assessment, variable contributions to outcome, linear programming and other
modeling tools.
Advanced Database Support
To deliver efficient decision support, OLAP tools must have advanced data access features include:
Access to many different kinds of DBMSs, flat files, and internal and external data
sources.
Access to aggregated data warehouse data as well as to detail data found in
operational databases.
Advanced data navigation features such as drill-down and roll-up.
Rapid and consistent query response times
Ability to map end-user requests to appropriate data source and then to proper data
access language (usually SQL)
Support for very large databases
Easy-to-Use End-User Interface: Many of interface features are borrowed from previous
generations of data analysis tools that are already familiar to end users. This familiarity makes
OLAP easily accepted and readily used.
Client Server Architecture
This provides a framework within which new systems can be designed, developed, and
implemented. The client/server envt:
Enables OLAP system to be divided into several components that define its
architecture
OLAP is designed to meet ease-of-use as well as system flexibility requirements
OLAP ARCHITECTURE :OLAP operational characteristics can be divided into three main
modules:
Graphical user interface (GUI)
Analytical processing logic.
Data-processing logic.
Designed to use both operational and data warehouse data
Defined as an advanced data analysis environment that supports decision making, business
modeling, and an operations research activities
In most implementations, data warehouse and OLAP are interrelated and complementary
environments
RELATIONAL OLAP: Provides OLAP functionality by using relational databases and familiar
relational query tools to store and analyze multidimensional data
Adds following extensions to traditional RDBMS:
Multidimensional data schema support within RDBMS
Data access language and query performance optimized for multidimensional data
Support for very large databases (VLDBs).
Relational vs. Multidimensional OLAP
Characteristic ROLAP MOLAP
Schema
Uses star schema
Additional dimensions can be
added dynamically
Uses data cubes
Additional dimensions require
re-creation of the data cube
Database size Medium to large
Small to medium
Architecture
Client/server
Standard-based
Open
Client/server
Proprietary
Access
Supports ad hoc requests
Unlimited dimensions
Limited to predefined
dimensions
Resources High Very high
Flexibility High Low
Scalability High Low
Speed Good with small data sets;
average for medium to large
data sets.
Faster for small to medium data
sets; average for large data sets.
Review Questions
1. What are decision support systems and what role do they play in the business envt.?
2. Explain how the main components of a DSS interact to form a system?
3. What are the most relevant differences between operational and decision support data?
4. What is a data warehouse and what are its main characteristics?
5. Give three examples of problems likely to be encountered when operational data are
integrated into the data warehouse.
While working as a DB analyst for a national sales organization, you are asked to be part of its
data warehouse project team.
6. Prepare a high level summary of the main requirements for evaluating DBMS products for
data warehousing.
8.Suppose you re selling the data warehouse idea to your users. How would you define
multidimensional data analysis for them? How would you explain its advantages to them?
9. before making a commitment, the data warehousing project group has invited you to provide
an OLAP overview. The groups members are particularly concerned about the OLAP
client/server architecture requirements and how OLAP will fit the existing environment. Your
job is to explain to them the main OLAP client/server components and architectures.
11. The project group is ready to make a final decision , choosing between ROLAP and
MOLAP. What should be the basis for this decision? Why?
14. What is OLAP and what are its main characteristics?
15. Explain ROLAP and give the reasons you would recommend its use in the relational DB
envt.
20. Explain some of the most important issues in data warehouse implementation.
Web DBMS
Database System: An Introduction to OODBMS and Web DBMS
PROBLEMS WITH RDBMSs
Poor representation of real world entities.
Semantic overloading.
Poor support for integrity and business constraints.
Homogeneous data structure.
Limited operations.
Difficulty handling recursive queries.
Impedance mismatch.
Difficulty with Long Transactions.
Object Oriented Database Management Systems (OODBMSs): These are an attempt at
marrying the power of Object Oriented Programming Languages with the persistence and
associated technologies of a DBMS.
Object Oriented Database Management System
OOPLs DBMSs
Complex Objects
Object Identity
Methods and Messages
Inheritance
Polymorphism
Extensibility
Computational completeness
Persistence
Disc Management
Data sharing
Reliability
Security
Ad Hoc Querying
THE OO DATABASE MANIFESTO
CHARACTERISTICS THAT MUST BE SUPPORTED
Complex objects
Object Identity
Encapsulation
Classes
Inheritance
Overriding and late-binding
Extensibility
Computational completeness
Persistence
Concurrency
Recovery
Ad-hoc querying
Requirements and Features
Requirements:
Transparently add persistence to OO programming languages
Ability to handle complex data - i.e., Multimedia data
Ability to handle data complexity - i.e., Interrelated data items
Add DBMS Features to OO programming languages.
Features:
The host programming language is also the DML.
The in-memory and storage models are merged.
No conversion code between models and languages is needed.
Data Storage for Web Site
File based systems:
information in separate HTML files
file management problems
information update problems
static web pages, non-interactive
Database based systems:
database accessed from the web
dynamic information handling
data management and data updating through the DBMS
Interconnected networks
TCP/IP (Transmission Control Protocol/ Internet Protocol)
http: (HyperText Transfer Protocol)
Internet Database
Web database connectivity allows new innovative services that:
Permit rapid response to competitive pressures by bringing new services and products to
market quickly.
Increase customer satisfaction through creation of Web-based support services.
Yield fast and effective information dissemination through universal access from across
street or across globe.
Characteristics and Benefits of Internet Technologies
Internal Characteristic Benefit
Hardware and Software independence Saving in equipment/software acquisition
Ability to run on most existing equipment
Platform independence and portability
No need for multiple platform development
Common and simple user interface Reduced training time and cost
Reduced end-user support cost
No need for multiple platform development
Location independence Global access through internet infrastructure
Reduced requirements (and costs) for dedicated
connections.
Rapid development at manageable costs Availability of multiple development tools
Plug-and-play development tools (open
standards)
More interactive development
Reduced development times
Relatively inexperience tools
Free client access tools (Web browsers)
Low entry costs frequent availability of free web
servers.
Reduced costs of maintaining private networks
Distributed processing and scalability, using
multiple servers.
Web-to-Database Middleware: Server-Side Extensions
A server-side extension is a program that interacts directly with the web server to handle specific
types of requests. It also makes it possible to retrieve and present the query results, but whats more
important is that it provides its services to the web server in a way that is totally transparent to the
client browser. In short, the server-side-extension adds significant functionality to the web server
and therefore to the internet.
A database server-side extension program is also known as Web-to-database middleware.
. The client browser sends a page request to the Web server.
The web server receives and validates the request.
The web-to-database middleware reads, validates and executes the script. In this
case, it connects to the database and passes the query using the database connectivity layer.
_ The database server executes the query and passes the result back to the Web-to-
database middleware.
The Web-to-database middleware complies the result set, dynamically generates an
HTML-formatted page that includes the data retrieved from the database and sends it to the
Web server.
: The Web server returns the just-created HTML page, which now includes the query
result, to the client browser.
l The client browser displays the page on the local computer.
The interaction between the Web server and the Web-to-database middleware is crucial to the
development of a successful internet database implementation. Therefore, the middleware must be
well integrated with the other internet services and the components that are involved in its use.
Web Server Interfaces:
It defines how a Web server communicates with external programs.
Two well-defined Web server interfaces:
Common Gateway Interface (CGI)
Application programming interface (API)
The Common Gateway Interface uses script files that perform specific functions based on the
clients parameters that are passed to the Web server. The script file is a small program containing
commands written in a programming language. The script files contents can be used to connect to
the DB and to retrieve data from it, using the parameters data to the Web server.
A script is a series of instructions executed in interpreter mode. The script is a plain text file that is
not compiled like COBOL, C++, or Java. Scripts are normally used in Web application
development environments.
An Application programming Interface is a newer Web server interface standard that is more
efficient and faster than a CGI script. APIs are more efficient because they are implemented as
shared code or as dynamic-link libraries (DLL). API are faster than CGI because the code resides in
memory and there is no need to run an external program for each request. APIs share the same
memory space as the Web server, an API error can bring down the server. The other disadvantage is
that APIs are specific to the Web server and to the operating system.
The Web Browser
This is the application software e.g Microsoft Internet Explorer, Mozilla Firefox, that lets users
navigate (browse) Web. Each time the end user clicks a hyperlink, the browser generates an HTTP
GET page request that is sent to the designated Web server using the TCP/IP internet protocol. The
Web browsers job is to interpret the HTML code that it receives from the Web server and to the
present the different page components in standard formatted way.
The Web as a Stateless System: Stateless system indicates that at any given time, Web server does
not know status of any of clients communicating with it. Client and server computers interact in
very short conversations that follow request-reply model.
XML Presentation
Extensible Markup Language (XML) is a metalanguage used to represent and manipulate data
elements. XML is designed to facilitate the exchange of structured documents, such as orders and
invoices over the internet. XML provides the semantics that facilitates the exchange, sharing and
manipulation of structured documents across organizational boundaries.
One of the main benefits of XML is that it separates data structure from its presentation and
processing.
Data Storage for Web Sites
http provides multiple transactions between clients and server
Based on Request-Response paradigm
- Connection (from client to web server)
- Request (message to web server)
- Response (information required as an HTML file)
- Close (connection to web server closed)
Web is used as an interface platform that provides access to one or more databases
Question of database connectivity
Open architecture approach to allow interoperability:
- (Distributed) Common Object Model (MS DCOM/ COM)
- CORBA (Common Request Broker Architecture)
- Java/RMI (Remote Method Invocation)
DBMS Architecture
Integration of web with database application
Vendor/ product independent connectivity
Interface independent of proprietary Web browsers
Capability to use all the features of the DBMS
Access to corporate data in a secure manner
Two-tier client-server architecture
-user interface/ transaction logic
- Database application/ data storage
Three-tier client-server architecture maps suitably to the Web environment
- First tier: Web browser, thin client
- Second tier: Application server, Web server
- Third tier: DBMS server, DBMS
DBMS - Web Architecture
Three-tier client-server
architecture -User interface
-Transaction / application logic
- DBMS
N-tier client-server architecture,
Internet Computing Model
- Web browser, (thin client)
- Web server
- Application server,
- DBMS server, DBMS
DBMS - Web
Architecture
Three-tier client-
server architecture
2
N-Tier
Client-
Server
(Internet
Computing)
Mo
del
DBMS
- Web
Archite
cture
2
Integrating the Web and DBMS
Integration between Web Server and Application Server
Web requests received by the Web Server invoke transactions on the Application Server
CGI (Common Gate Interface)
Non-CGI Gateways.
CGI (Common Gate Interface)
- transfers information between a Web Server and a CGI Program
- CGI programs (scripts) run on either Web Server or Application Server
- scripts can be written in VBScript or Perl
- CGI is web server independent and scripting language independent.
N-Tier Client-Server
(Internet Computing
Model)
DBMS - Web
Architecture
2
Non-CGI Gateways
- Proprietary interfaces, specific to a vendors web server
- Netscape API (Sun Micro Systems)
- ASP (Active Server Pages), (Microsoft Internet Information Server)
Integration between Application Server and DBMS
- applications on the Application Server connect to and interacts with the Database
- connections between Application Server and the Databases provided by API
(Application Programming Interface)
Standard APIs:
ODBC (Open Database Connectivity) connects application programs to DBMS.
JDBC (Java Database Connectivity) connects Java applications to DBMS
ODBC (Open Database Connectivity)
- standard API, common interface for accessing SQL databases
- DBMS vendor provides a set of library functions for database access
- Functions are invoked by application software
- execute SQL statements to return rows of data as a result of data search
- de facto industry standard
ODBC is Microsofts implementation of a superset of the SQL Access Group Call Level Interface
(CLI) standard for DB access. ODBC is probably the most widely supported database connectivity
interface. ODBC allows any Windows application to access relational data source using SQL via a
standard application programming interface (API). Microsoft also developed two other data access
interface: Data Access Objects (DAO) and Remote Data Objects (RDO).
Integrating the Web
and DBMS
CGI (Common
Gate Interface)
Environment
2
DAO is an object-oriented API used to access MS Access, MS FpxPro, and dBase databases (using
the Jet data engine) from visual basic programs.
RDO: is a higher-level object-oriented application interface used to access remote database servers.
The Basic ODBC architecture has three main Components:
A higher-level ODBC API through which application programs access ODBC functionality.
A driver manager that is in charge of managing all DB connections.
An ODBC driver that communicates directly to the DBMS
Defining a data source is the first step in using ODBC. To define a data source, you must create a
data source name DSN for the data source. To create a DSN you need to provide:
An ODBC driver
A DSN Name
ODBC Driver Parameters
JDBC (Java Database Connectivity)
- modelled after ODBC, a standard API
- provides access to DBMSs from Java application programs
- machine independent architecture
- direct mapping of RDBMS tables to Java classes
- SQL statements used as string variables in Java methods (embedded SQL).
Review Questions
1. Difference between DAO and RDO
2. Three basic components of ODBC architecture.
3. What steps are required to create an ODBC data source name?
4. What are Web server interfaces used for? Give some examples.
5. What does this statement mean: The web is a stateless system? What implications does a
stateless system have for DB applications developers?
6. What is a web application server and how does it work from a database perspertive.
7. What are scripts, and what is their function? ( Think in terms of DB applications
development.)
8. What is XML and why is it important?