Database Design Concepts Introduction Notes
Database Design Concepts Introduction Notes
Database Design Concepts Introduction Notes
INTRODUCTION
What is a database?
A database is a collection of related data.
Data is known facts that can be recorded and that have implicit meaning.
A database has the following implicit properties:
A database represents some aspect of the real world, sometimes called the miniworld or
the universe of discourse (DoD). Changes to the miniworld are reflected in the database.
A database is a logically coherent collection of data with some inherent meaning. A
random assortment of data cannot correctly be referred to as a database.
A database is designed, built, and populated with data for a specific purpose. It has an
intended group of users and some preconceived applications in which these users are
interested.
A database management system (DBMS) is a collection of programs that enables users to create
and maintain a database. The DBMS is hence a general-purpose software system that facilitates
the processes of defining, constructing, manipulating, and sharing databases among various users
and applications.
Defining a database involves specifying the data types, structures, and constraints for the data to
be stored in the database.
Constructing the database is the process of storing the data itself on some storage medium that is
controlled by the DBMS.
Manipulating a database includes such functions as querying the database to retrieve specific
data, updating the database to reflect changes in the miniworld, and generating reports from the
data.
Sharing a database allows multiple users and programs to access the database concurrently.
Protection includes both system protection against hardware or software malfunction (or
crashes), and security protection against unauthorized or malicious access.
A database system is the database and DBMS software together.
File systems
File processing systems was an early attempt to computerize the manual filing system that we are
all familiar with. A file system is a method for storing and organizing computer files and the data
they contain to make it easy to find and access them. File systems may use a storage device such
as a hard disk or CD-ROM and involve maintaining the physical location of the files.
In our own home, we probably have some sort of filing system, which contains receipts,
guarantees, invoices, bank statements, and such like. When we need to look something up, we go
2
to the filing system and search through the system starting from the first entry until we find what
we want. Alternatively, we may have an indexing system that helps to locate what we want more
quickly. For example we may have divisions in the filing system or separate folders for different
types of item that are in some way logically related.
The manual filing system works well when the number of items to be stored is small. It even
works quite adequately when there are large numbers of items and we have only to store and
retrieve them. However, the manual filing system breaks down when we have to cross-reference
or process the information in the files. For example, a typical real estate agent's office might have
a separate file for each property for sale or rent, each potential buyer and renter, and each
member of staff.
Clearly the manual system is inadequate for this' type of work. The file based system was
developed in response to the needs of industry for more efficient data access. In early processing
systems, an organization's information was stored as groups of records in separate files.
In the traditional approach, we used to store information in flat files which are maintained by the
file system under the operating system's control. Here, flat files are files containing records
having no structured relationship among them. The file handling which we learn under C/C ++ is
the example of file processing system. The Application programs written in C/C ++ like
programming languages go through the file system to access these flat. files as shown.
Each file contained and processed information for one specific function, such as accounting or
inventory.
Files are designed by using programs written in programming languages such as COBOL, C,
C++.
The physical implementation and access procedures are written into database application;
therefore, physical changes resulted in intensive rework on the part of the programmer.
As systems became more complex, file processing systems offered little flexibility, presented
many limitations, and were difficult to maintain.
In other words, in file based approach application programs are data dependent. It means that,
with the change in the physical representation (how the data is physically represented in disk) or
access technique (how it is physically accessed) of data, application programs are also affected
and needs modification. In other words application programs are dependent on the how the data
is physically stored and accessed.
If for example, if the physical format of the master/transaction file is changed, by making the
modification in the delimiter of the field or record, it necessitates that the application programs
which depend on it must be modified.
Let us consider a student file, where information of students is stored in text file and each field is
separated by blank space as shown below:
I Rahat 35 Thapar
Now, if the delimiter of the field changes from blank space to semicolon as shown below:
1; Rahat; 35; Thapar
Then, the application programs using this file must be modified, because now it will token the
field on semicolon; but earlier it was blank space.
4. Difficulty in representing data from the user's view: To create useful applications for the
user, often data from various files must be combined. In file processing it was difficult to
determine relationships between isolated data in order to meet user requirements.
5. Data Inflexibility: Program-data interdependency and data isolation, limited the flexibility of
file processing systems in providing users with ad-hoc information requests
6. Incompatible file formats: As the structure of files is embedded in the application programs,
the structures are dependent on the application programming language. For example, the
structure of a file generated by a COBOL program may be different from the structure of a file
generated by a 'C' program. The direct incompatibility of such files makes them difficult to
process jointly.
7. Data Security. The security of data is low in file based system because, the data is maintained
in the flat file(s) is easily accessible. For Example: Consider the Banking System. The Customer
Transaction file has details about the total available balance of all customers. A Customer wants
information about his account balance. In a file system it is difficult to give the Customer access
to only his data in the file. Thus enforcing security constraints for the entire file or for certain
data items are difficult.
8. Transactional Problems. The File based system approach does not satisfy transaction
properties like Atomicity, Consistency, Isolation and Durability properties commonly known as
ACID properties.
For example: Suppose, in a banking system, a transaction that transfers Rs. 1000 from account A
to account B with initial values' of A and B being Rs. 5000 and Rs. 10000 respectively. If a
5
system crash occurred after the withdrawal of Rs. 1000 from account A, but before depositing of
amount in account B, it will result an inconsistent state of the system. It means that the
transactions should not execute partially but wholly. This concept is known as Atomicity of a
transaction (either 0% or 100% of transaction). It is difficult to achieve this property in a file
based system.
9. Concurrency problems. When multiple users access the same piece of data at same interval
of time then it is called as concurrency of the system. When two or more users read the data
simultaneously there is ll( problem, but when they like to update a file simultaneously, it may
result in a problem.
For example:
Let us consider a scenario where in transaction T 1 a user transfers an amout1t 1000 from
Account A to B (initial value of A is 5000 and B is 8000). In mean while, another transaction T2,
tries to display the sum of account A and B is also executed. If both the transaction runs in
parallel it may results inconsistency as shown below:
The above schedule results inconsistency of database and it shows Rs.12,000 as sum of accounts
A and B instead of Rs .13,000. The problem occurs because second concurrently running
transaction T2, reads A and B at intermediate point and computes its sum, which results
inconsistent value.
10. Poor data modeling of real world. The file based system is not able to represent the
complex data and interfile relationships, which results poor data modeling properties.
on the data. The information stored in the catalog is called meta-data, and it describes the
structure of the primary database
Insulation between programs and data, and data abstraction
The structure of data files is stored in the DBMS catalog separately from the access programs.
We call this property program-data independence. The characteristic that allows program-data
independence and program-operation independence is called data abstraction. A DBMS provides
users with a conceptual representation of data that does not include many of the details of how
the data is stored or how the operations are implemented. Informally, a data model is a type of
data abstraction that is used to provide this conceptual representation. The data model uses
logical concepts, such as objects, their properties, and their interrelationships, that may be easier
for most users to understand than computer storage concepts. Hence, the data model hides
storage and implementation details that are not of interest to most database users.
Support of multiple views of the data
A database typically has many users, each of whom may require a different perspective or view
of the database. A view may be a subset of the database or it may contain virtual data that is
derived from the database files but is not explicitly stored. Some users may not need to be aware
of whether the data they refer to is stored or derived. A multiuser DBMS whose users have a
variety of distinct applications must provide facilities for defining multiple views
Sharing of data and multiuser transaction processing
A multiuser DBMS, as its name implies, must allow multiple users to access the database at the
same time. This is essential if data for multiple applications is to be integrated and maintained in
a single database. The DBMS must include concurrency control software to ensure that several
users trying to update the same data do so in a controlled manner so that the result of the updates
is correct.
Roles in the database environment
Database Administrators
In any organization where many persons use the same resources, there is a need for a chief
administrator to oversee and manage these resources. In a database environment, the primary
resource is the database itself, and the secondary resource is the DBMS and related software.
Administering these resources is the responsibility of the database administrator (DBA). The
DBA is responsible for authorizing access to the database, for coordinating and monitoring its
use, and for acquiring software and hardware resources as needed. The DBA is accountable for
7
problems such as breach of security or poor system response time. In large organizations, the
DBA is assisted by a staff that helps carry out these functions.
Database Designers
Database designers are responsible for identifying the data to be stored in the database and for
choosing appropriate structures to represent and store this data. These tasks are mostly
undertaken before the database is actually implemented and populated with data. It is the
responsibility of database designers to communicate with all prospective database users in order
to understand their requirements, and to come up with a design that meets these requirements. In
many cases, the designers are on the staff of the DBA and may be assigned other staff
responsibilities after the database design is completed. Database designers typically interact with
each potential group of users and develop views of the database that meet the data and
processing requirements of these groups. Each view is then analyzed and integrated with the
views of other user groups. The final database design must be capable of supporting the
requirements of all user groups.
End Users
End users are the people whose jobs require access to the database for querying, updating, and
generating reports; the database primarily exists for their use. There are several categories of end
users:
Casual end users occasionally access the database, but they may need different information
each time. They use a sophisticated database query language to specify their requests and are
typically middle- or high-level managers or other occasional browsers.
Naive or parametric end users make up a sizable portion of database end users. Their main job
function revolves around constantly querying and updating the database, using standard types of
queries and updates-called canned transactions-that have been carefully programmed and tested.
The tasks that such users perform are varied:
Bank tellers check account balances and post withdrawals and deposits.
Reservation clerks fur airlines, hotels, and car rental companies check availability for a given
request and make reservations.
Clerks at receiving stations for courier mail enter package identifications via bar codes and
descriptive information through buttons to update a central database of received and in-transit
packages.
Sophisticated end users include engineers, scientists, business analysts, and others who
thoroughly familiarize themselves with the facilities of the DBMS so as to implement their
applications to meet their complex requirements.
Stand-alone users maintain personal databases by using ready-made program packages that
provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a tax
package that stores a variety of personal financial data for tax purposes.
A typical DBMS provides multiple facilities to access a database. Naive end users need to learn
very little about the facilities provided by the DBMS; they have to understand only the user
interfaces of the standard transactions designed and implemented for their use. Casual users learn
only a few facilities that they may use repeatedly. Sophisticated users try to learn most of the
DBMS facilities in order to achieve their complex requirements. Stand-alone users typically
become very proficient in using a specific software package.
System Analysts and Application Programmers (Software Engineers)
System analysts determine the requirements of end users, especially naive and parametric end
users, and develop specifications for canned transactions that meet these requirements.
Application programmers implement these specifications as programs; then they test, debug,
document, and maintain these canned transactions. Such analysts and programmers-commonly
referred to as software engineers-should be familiar with the full range of capabilities provided
by the DBMS to accomplish their tasks.
In addition to those who design, use, and administer a database, others are associated with the
design, development, and operation of the DBMS software and system environment. These
persons are typically not interested in the database itself. We call them the "workers behind the
scene," and they include the following categories.
DBMS system designers and implementers are persons who design and implement the DBMS
modules and interfaces as a software package. A DBMS is a very complex software system that
consists of many components, or modules, including modules for implementing the catalog,
processing query language, processing the interface, accessing and buffering data, controlling
concurrency, and handling data recovery and security. The DBMS must interface with other
system software, such as the operating system and compilers for various programming
languages.
Tool developers include persons who design and implement tools-the software packages that
facilitate database system design and use and that help improve performance. Tools are optional
packages that are often purchased separately. They include packages for database design,
performance monitoring, natural language or graphical interfaces, prototyping, simulation, and
9
test data generation. In many cases, independent software vendors develop and market these
tools.
Operators and maintenance personnel are the system administration personnel who are
responsible for the actual running and maintenance ofthe hardware and software environment for
the database system.
Although these categories of workers behind the scene are instrumental in making the database
system available to end users, they typically do not use the database for their own purposes.
It is clear from the above file systems, that there is some common data of the student which has
to be mentioned in each application, like Rollno, Name, Class, Phone_No~ Address etc. This will
cause the problem of redundancy which results in wastage of storage space and difficult to
10
maintain, but in case of centralized database, data can be shared by number of applications and
the whole college can maintain its computerized data with the following database:
It is clear in the above database that Rollno, Name, Class, Father_Name, Address, Phone_No,
Date_of_birth which are stored repeatedly in file system in each application, need not be stored
repeatedly in case of database, because every other application can access this information by
joining of relations on the basis of common column i.e. Rollno. Suppose any user of Library
system need the Name, Address of any particular student and by joining of Library and General
Office relations on the basis of column Rollno he/she can easily retrieve this information.
Thus, we can say that centralized system of DBMS reduces the redundancy of data to great
extent but cannot eliminate the redundancy because RollNo is still repeated in all the relations.
2. Integrity can be enforced: Integrity of data means that data in database is always accurate,
such that incorrect information cannot be stored in database. In order to maintain the integrity of
data, some integrity constraints are enforced on the database. A DBMS should provide
capabilities for defining and enforcing the constraints.
For Example: Let us consider the case of college database and suppose that college having only
BTech, MTech, MSc, BCA, BBA and BCOM classes. But if a \.,ser enters the class MCA, then
this incorrect information must not be stored in database and must be prompted that this is an
invalid data entry. In order to enforce this, the integrity constraint must be applied to the class
attribute of the student entity. But, in case of file system tins constraint must be enforced on all
the application separately (because all applications have a class field).
In case of DBMS, this integrity constraint is applied only once on the class field of the General
Office (because class field appears only once in the whole database), and all other applications
will get the class information about the student from the General Office table so the integrity
constraint is applied to the whole database. So, we can conclude that integrity constraint can be
easily enforced in centralized DBMS system as compared to file system.
11
3. Inconsistency can be avoided: When the same data is duplicated and changes are made at
one site, which is not propagated to the other site, it gives rise to inconsistency and the two
entries regarding the same data will not agree. At such times the data is said to be inconsistent.
So, if the redundancy is removed chances of having inconsistent data is also removed.
Let us again, consider the college system and suppose that in case of General_Office file it is
indicated that Roll_Number 5 lives in Amritsar but in library file it is indicated that
Roll_Number 5 lives in Jalandhar. Then, this is a state at which tIle two entries of the same
object do not agree with each other (that is one is updated and other is not). At such time the
database is said to be inconsistent.
An inconsistent database is capable of supplying incorrect or conflicting information. So there
should be no inconsistency in database. It can be clearly shown that inconsistency can be avoided
in centralized system very well as compared to file system.
Let us consider again, the example of college system and suppose that RollNo 5 is .shifted from
Amritsar to Jalandhar, then address information of Roll Number 5 must be updated, whenever
Roll number and address occurs in the system. In case of file system, the information must be
updated separately in each application, but if we make updation only at three places and forget to
make updation at fourth application, then the whole system show the inconsistent results about
Roll Number 5.
In case of DBMS, Roll number and address occurs together only single time in General_Office
table. So, it needs single updation and then another application retrieve the address information
from General_Office which is updated so, all application will get the current and latest
information by providing single update operation and this single update operation is propagated
to the whole database or all other application automatically, this property is called as Propagation
of Update.
We can say the redundancy of data greatly affect the consistency of data. If redundancy is less, it
is easy to implement consistency of data. Thus, DBMS system can avoid inconsistency to great
extent.
4. Data can be shared: As explained earlier, the data about Name, Class, Father __name etc. of
General_Office is shared by multiple applications in centralized DBMS as compared to file
system so now applications can be developed to operate against the same stored data. The
applications may be developed without having to create any new stored files.
12
5. Standards can be enforced: Since DBMS is a central system, so standard can be enforced
easily may be at Company level, Department level, National level or International level. The
standardized data is very helpful during migration or interchanging of data. The file system is an
independent system so standard cannot be easily enforced on multiple independent applications.
6. Restricting unauthorized access: When multiple users share a database, it is likely that some
users will not be authorized to access all information in the database. For example, account office
data is often considered confidential, and hence only authorized persons are allowed to access
such data. In addition, some users may be permitted only to retrieve data, whereas other are
allowed both to retrieve and to update. Hence, the type of access operation retrieval or update
must also be controlled. Typically, users or user groups are given account numbers protected by
passwords, which they can use to gain access to the database. A DBMS should provide a security
and authorization subsystem, which the DBA uses to create accounts and to specify account
restrictions. The DBMS should then enforce these restrictions automatically.
7. Solving Enterprise Requirement than Individual Requirement: Since many types of users
with varying level of technical knowledge use a database, a DBMS should provide a variety of
user interface. The overall requirements of the enterprise are more important than the individual
user requirements. So, the DBA can structure the database system to provide an overall service
that is "best for the enterprise".
For example: A representation can be chosen for the data in storage that gives fast access for the
most important application at the cost of poor performance in some other application. But, the
file system favors the individual requirements than the enterprise requirements
8. Providing Backup and Recovery: A DBMS must provide facilities for recovering from
hardware or software failures. The backup and recovery subsystem of the DBMS is responsible
for recovery. For example, if the computer system fails in the middle of a complex update
program, the recovery subsystem is responsible for making sure that the .database is restored to
the state it was in before the program started executing.
9. Cost of developing and maintaining system is lower: It is much easier to respond to
unanticipated requests when data is centralized in a database than when it is stored in a
conventional file system. Although the initial cost of setting up of a database can be large, but the
cost of developing and maintaining application programs to be far lower than for similar service
using conventional systems. The productivity of programmers can be higher in using nonprocedural languages that have been developed with DBMS than using procedural languages.
10. Data Model can be developed: The centralized system is able to represent the complex data
and interfile relationships, which results better data modeling properties. The data madding
13
properties of relational model is based on Entity and their Relationship, which is discussed in
detail in chapter 4 of the book.
11. Concurrency Control: DBMS systems provide mechanisms to provide concurrent access of
data to multiple users.
Disadvantages of DBMS
The disadvantages of the database approach are summarized as follows:
1. Complexity: The provision of the functionality that is expected of a good DBMS makes the
DBMS an extremely complex piece of software. Database designers, developers, database
administrators and end-users must understand this functionality to take full advantage of it.
Failure to understand the system can lead to bad design decisions, which can have serious
consequences for an organization.
2. Size: The complexity and breadth of functionality makes the DBMS an extremely large piece
of software, occupying many megabytes of disk space and requiring substantial amounts
of memory to run efficiently.
3. Performance: Typically, a File Based system is written for a specific application, such as
invoicing. As result, performance is generally very good. However, the DBMS is written to be
more general, to cater for many applications rather than just one. The effect is that some
applications may not run as fast as they used to.
4. Higher impact of a failure: The centralization of resources increases the vulnerability of the
system. Since all users and applications rely on the ~vailabi1ity of the DBMS, the failure of any
component can bring operations to a halt.
5. Cost of DBMS: The cost of DBMS varies significantly, depending on the environment and
functionality provided. There is also the recurrent annual maintenance cost.
6. Additional Hardware costs: The disk storage requirements for the DBMS and the database
may necessitate the purchase of additional storage space. Furthermore, to achieve the required
performance it may be necessary to purchase a larger machine, perhaps even a machine
dedicated to running the DBMS. The procurement of additional hardware results in further
expenditure.
7. Cost of Conversion: In some situations, the cost of the DBMS and extra hardware may be
insignificant compared with the cost of converting existing applications to run on the new DBMS
and hardware. This cost also includes the cost of training staff to use these new systems and
possibly the employment of specialist staff to help with conversion and running of the system.
14
This cost is one of the main reasons why some organizations feel tied to their current systems
and cannot switch to modern database technology.
Database Architecture
DBMSs do not all conform to the same architecture.
external
conceptual
internal
15
16
a data definition language (DDL) - provides for the definition or description of database
objects
Each user sees the data in terms of an external view: Defined by an external schema, consisting
basically of descriptions of each of the various types of external record in that external view, and
also a definition of the mapping between the external schema and the underlying conceptual
schema.
17
Conceptual View
It is in general a view of the data as it actually is, that is, it is a `model' of the `realworld'.
Internal View
The internal view is a low-level representation of the entire database consisting of multiple
occurrences of multiple types of internal (stored) records.
It is however at one remove from the physical level since it does not deal in terms of physical
records or blocks nor with any device specific constraints such as cylinder or track sizes. Details
of mapping to physical storage is highly implementation specific and are not expressed in the
three-level architecture.
The internal view described by the internal schema:
An external/conceptual mapping:
o
A change to the storage structure definition means that the conceptual/internal mapping
must be changed accordingly, so that the conceptual schema may remain invariant,
achieving physical data independence.
A change to the conceptual definition means that the conceptual/external mapping must
be changed accordingly, so that the external schema may remain invariant, achieving
logical data independence.
Database languages
Once the design of a database is completed and a DBMS is chosen to implement the database,
the first order of the day is to specify conceptual and internal schemas for the database and any
mappings between the two. In many DBMSs where no strict separation of levels is maintained,
one language, called the data definition language (OOL), is used by the DBA and by database
designers to define both schemas. The DBMS will have a DDL compiler whose function is to
process LJDL statements in order to identify descriptions of the schema constructs and to store
the schema description in the DBMS catalog. In DBMSs where a clear separation is maintained
between the conceptual and internal levels, the DDL is used to specify the conceptual schema
only. Another language, the storage definition language (SOL), is used to specify the internal
schema. The mappings between the two schemas may be specified in either one of these
languages. For a true three-schema architecture, we would need a third language, the view
definition language (VDL), to specify user views and their mappings to the conceptual schema,
but in most DBMSs the DDL is used to define both conceptual and external schemas. Once the
database schemas arc compiled and the database is populated with data, users must have some
means to manipulate the database. Typical manipulations include retrieval, insertion, deletion,
and modification of the data. The DBMS provides a set of operations or a language called the
data manipulation language (OML) for these purposes. In current DBMSs, the preceding types of
languages are usually not considered distinct languages; rather, a comprehensive integrated
language is used that includes constructs for conceptual schema definition, view definition and
data manipulation. Storage definition is typically kept separate, since it is used for defining
physical storage structures to fine tune the performance of the database system, which is usually
19
done by the DBA staff. A typical example of a comprehensive database language is the SQL
relational database language which represents a combination of DDL, VDL, and DML, as well as
statements for constraint specification, schema evolution, and other features. The SDL was a
component in early versions of SQL but has been removed from the language to keep it at the
conceptual and external levels only.
Conceptual modelling
The Conceptual Design phase takes the high-level data model and converts into a conceptual
schema, which is specific to a particular DBMS class (e.g. relational). For a relational system,
such as Oracle, an appropriate conceptual schema would be relations.
20
Finally, in the Physical Design phase the conceptual schema is converted into database internal
structures. This is specific to a particular DBMS product.
Basics
Entity Relationship (ER) modelling
is a design tool
Entities
An entity is any object in the system that we want to model and store information about
Groups of the same type of objects are called entity types or entity sets
Figure: Entities
There are two types of entities; weak and strong entity types.
Attribute
Can have different attribute values than that in any other entity.
Attributes can be
simple or composite
single-valued or multi-valued
Note that entity types can have a large number of attributes... If all are shown then the
diagrams would be confusing. Only show an attribute if it adds information to the ER
diagram, or clarifies a point.
Figure : Attributes
Keys
A key is a data item that allows us to uniquely identify individual occurrences or an entity
type.
An entity type may have one or more possible candidate keys, the one which is selected
is known as the primary key.
Relationships
22
A relationship is an association of entities where the association includes one entity from
each participating entity type.
In the original Chen notation, the relationship is placed inside a diamond, e.g. managers
manage employees:
For this module, we will use an alternative notation, where the relationship is a label on
the line. The meaning is identical
It is a relationship where the same entity participates more than once in different roles.
In the example above we are saying that employees are managed by employees.
If we wanted more information about who manages whom, we could introduce a second
entity type called manager.
Degree of a Relationship
It is also possible to have entities associated through two or more distinct relationships.
This can result in the loss of some information - It is no longer clear which sales assistant
sold a customer a particular product.
Try replacing the ternary relationship with an entity type and a set of binary relationships.
Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as
a noun.
24
So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product.
Cardinality
This is described by the cardinality of the relationship, for which there are four possible
categories.
A one to one relationship - a man can only marry one woman, and a woman can only
marry one man, so it is a one to one (1:1) relationship
A one to may relationship - one manager manages many employees, but each employee
only has one manager, so it is a one to many (1:n) relationship
25
A many to one relationship - many students study one course. They do not study more
than one course, so it is a many to one (m:1) relationship
A many to many relationship - One lecturer teaches many students and a student is taught
by many lecturers, so it is a many to many (m:n) relationship
an entity at one end of the relationship must be related to an entity at the other end.
But a course can exist before any students have enrolled. Thus the relationship `course
is_studied_by student' is optional.
To show optionality, put a circle or `0' at the `optional end' of the relationship.
As the optional relationship is `course is_studied_by student', and the optional part of this
is the student, then the `O' goes at the student end of the relationship connection.
It is important to know the optionality because you must ensure that whenever you create
a new entity it has the required mandatory links.
26
Entity Sets
Sometimes it is useful to try out various examples of entities from an ER model. One reason for
this is to confirm the correct cardinality and optionality of a relationship. We use an `entity set
diagram' to show entity examples graphically. Consider the example of `course is_studied_by
student'.
Confirming Correctness
Go back to the requirements specification and check to see if they are allowed.
This allows you to show the cardinality and optionality of the relationship
If the answer was `one or more', then the relationship would be `mandatory'.
The answer `one' means that the cardinality of this relationship is 1, and is
`mandatory'
If the answer had been `zero or one', then the cardinality of the relationship would
have been 1, and be `optional'.
Redundant relationships
Some ER diagrams end up with a relationship loop.
Given three entities A, B, C, where there are relations A-B, B-C, and C-A, check if it is
possible to navigate between A and C via B. If it is possible, then A-C was a redundant
relationship.
Always check carefully for ways to simplify your ER diagram. It makes it easier to read
the remaining information.
Consider entities `customer' (customer details), `address' (the address of a customer) and
`distance' (distance from the company to the customer address).
e.g. if modelling a library, the entity types might be books, borrowers, etc.
3. List the attributes of each entity (all properties to describe the entity which are relevant to
the application).
o
if so keep them as attributes and cross them off the entity list.
29
ER modelling is an iterative process, so draw several versions, refining each one until you are
happy with it. Note that there is no one right answer to the problem, but some solutions are better
than others!
Entity Relationship Modelling - 2
Country Bus Company
A Country Bus Company owns a number of busses. Each bus is allocated to a particular route,
although some routes may have several busses. Each route passes through a number of towns.
One or more drivers are allocated to each stage of a route, which corresponds to a journey
through some or all of the towns on a route. Some of the towns have a garage where busses are
kept and each of the busses are identified by the registration number and can carry different
numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each
route is identified by a route number and information is available on the average number of
passengers carried per day for each route. Drivers have an employee number, name, address, and
sometimes a telephone number.
Entities
Bus - Company owns busses and will hold information about them.
Town - Buses pass through towns and need to know about them
Garage - Garage houses buses, and need to know where they are.
Relationships
A garage keeps buses and each bus has one `home' garage
31
Bus (reg-no,make,size,deck,no-pass)
Route (route-no,avg-pass)
Driver (emp-no,name,address,tel-no)
Town (name)
Stage (stage-no)
Garage (name,address)
Chasm traps
A chasm trap occurs when a model suggests the existence of a relationship between entity types,
but the pathway does not exist between certain entity occurrences.
It occurs where there is a relationship with partial participation, which forms part of the pathway
between entities that are related.
A single branch is allocated many staff who oversee the management of properties for
rent. Not all staff oversee property and not all property is managed by a member of staff.
The partial participation of Staff and Property in the oversees relation means that some
properties cannot be associated with a branch office through a member of staff.
We need to add the missing relationship which is called `has' between the Branch and the
Property entities.
You need to therefore be careful when you remove relationships which you consider to be
redundant.
Specialisation
Generalisation
Categorisation
Aggregation
33
Superclass - an entity type that includes distinct subclasses that require to be represented
in a data model.
Subclass - an entity type that has a distinct role and is also a member of a superclass.
Staff(staff_no,name,address,dob)
Manager(bonus)
Secretary(wp_skills)
Sales_personnel(sales_area, car_allowance)
Here we have shown that the manages relationship is only applicable to the Manager
subclass, whereas the works_for relationship is applicable to all staff.
Generalisation
Generalisation is the process of minimising the differences between entities by identifying
common features.
This is the identification of a generalised superclass from the original subclasses. This is the
process of identifying the common attributes and relationships.
For instance, taking:
car(regno,colour,make,model,numSeats)
motorbike(regno,colour,make,model,hasWindshield)
And forming:
vehicle(regno,colour,make,model,numSeats,hasWindshielf)
In this case vehicle has numSeats which would be NULL if the vehicle was a motorbike, and has
hasWindshield which would be NULL if it was a car.
Mapping ER Models into Relations
What is a relation?
A relation is a table that holds the data we are interested in. It is two-dimensional and has rows
and columns.
Each entity type in the ER model is mapped into a relation.
Figure : a relation
35
Roughly, each foreign key represents a relationship between two entity types.
It will generally have a foreign key from each table that it is related to.
The choice of which entity type subsumes the other depends on which is the most
important entity type (more attributes, better key, semantic nature of them).
The result of this amalgamation is that all the attributes of the `swallowed up' entity
become attributes of the more important entity.
The primary key of the new combined entity is usually the same as that of the original
more important entity type.
the two entity types represent different entities in the `real world'.
If not combined...
If the two entity types are kept separate then the association between them must be represented
by a foreign key.
The primary key of one entity type comes the foreign key in the other.
It does not matter which way around it is done but you should not have a foreign key in
each entity.
Example
Each member of staff must have one contract and each contract must have one member of
staff associated with it.
37
or
Staff(emp_no, name)
Contract(cont_no, start, end, position, salary, emp_no)
Mandatory <->Optional
The entity type of the optional end may be subsumed into the mandatory end as in the previous
example.
It is better NOT to subsume the mandatory end into the optional end as this will create null
entries.
Map the foreign key into Staff - the key is null for staff without a contract.
Map the foreign key into Contract - emp_no is mandatory thus never null.
Staff(emp_no, name)
Contract(cont_no, start, end, position, salary, emp_no)
Example
Consider this example:
38
Contract 11, from 1st Jan 2001 to 10th Jan 2001, lecturer, on 2.00 a year.
Start
End
Position
Salary
11
Staff Table:
Lecturer
2.00
Empno
Name
Contract No
10
Gordon
11
11
Andrew
However, Foreign key in Contract:
NULL
Contract Table:
Cont_no
Start
End
Position
Salary
Empno
11
Staff Table:
Lecturer
2.00
10
Empno
Name
10
Gordon
11
Andrew
As you can see, both ways store the same information, but the second way has no NULLs.
Mandatory <->Optional - Subsume?
The reasons for not subsuming are the same as before with the following additional reason.
very few of the entities from the mandatory end are involved in the relationship. This
could cause a lot of wasted space with many blank or null entries.
39
If only a few lecturers manage courses and Course is subsumed into Lecturer then there
would be many null entries in the table.
Lecturer(lect_no, l_name)
Course(cno, c_name, type, yr_vetted, external,lect_no)
Summary...
So for 1:1 optional relationships, take the primary key from the `mandatory end' and add it to the
`optional end' as a foreign key.
So, given entity types A and B, where A <->B is a relationship where the A end it optional, the
result would be:
A (primary key,attribute,...,foreign key to B)
B (primary key,attribute,...)
Optional at both ends...
Such examples cannot be amalgamated as you could not select a primary key. Instead, one
foreign key is used as before.
If emp_no is used then all the cars which are not being leased will not have a key.
Similarly, if the reg_no is used, all the staff not leasing a car will not have a key.
To map 1:m relationships, the primary key on the `one side' of the relationship is added to the
`many side' as a foreign key.
For example, the 1:m relationship `course-student':
Course(course_no, c_name)
Student(matric_no, st_name, dob)
Course(course_no, c_name)
Student(matric_no, st_name, dob, course_no)
If an entity type participates in several 1:m relationships, then you apply the rule to each
relationship, and add foreign keys as appropriate.
A new relation is produced which contains the primary keys from both sides of the
relationship
Thus
becomes
42