0% found this document useful (0 votes)
16 views

SS3 TERM 1

The document provides an overview of database indexing, detailing its types, structures, and mechanisms for optimizing data retrieval. It also discusses database security, emphasizing the importance of confidentiality, integrity, and availability, along with various security controls and access models. Additionally, it outlines the roles of a database administrator in maintaining security and data integrity.

Uploaded by

toniella2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

SS3 TERM 1

The document provides an overview of database indexing, detailing its types, structures, and mechanisms for optimizing data retrieval. It also discusses database security, emphasizing the importance of confidentiality, integrity, and availability, along with various security controls and access models. Additionally, it outlines the roles of a database administrator in maintaining security and data integrity.

Uploaded by

toniella2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

INDEXES

INDEX:- This is considered as a copy of a database table that has been reduced to certain field. The reduced
copy is always in sorted form. Sorting provides faster access to the data record of the table.( e.g using binary
search). Index also contains pointer to the corresponding record of the actual table so that the field not
contained in the index can also be read.

INDEXING:- These are mechanisms used to optimise certain accesses to data (records) managed in files e.g
the author catalogue in a library is a type of index.
Index can be viewed as collection of data entities with an efficient way to locate all data entries with search
key value K. Each data entries is denoted as k* which contains enough information for retrieving (one or
more) data records with search key value k.

SEARCH KEY:- This is an attribute or combination of attributes used to look up records in a file.

AN INDEX FILE:- This consist of records called index entries, it is in this format,

Search key value pointer to blocks in data file

Index files are typically much smaller than the original file because only the values for search key and pointer
are stored.

A DATABASE INDEX:- This is a data structure that improves the speed of data retrieval operations on a
database table at the cost of slower writes and the use of more storage space. Index can be created using one
or more columns of a database table.

How are data entries organized in order to support efficient retrieval of data entries with a given search
key value?

1. One way of organizing data entries is to hash data entries on the search key. In this approach we treat
the collection of data entries as a file of records hashed on the search key.

Smith, 44, 3000 3000


James, 40, 6003 3000 h(sal) = 00
h(age) = 00
Tims, 44, 5004 5004
5004

age h(age) = 01 Asry, 25, 3000 sal


h1 h2
Basu, 33, 4003
4003 h(sal) = 11
Baly, 29, 2007
h(age) = 10 2007
6003
Cass, 50, 5004
6003
Don, 22, 6003

File hashed on age File of (sal, id) pairs, hashed on sal

NOTE:-In the example above the hashed function h is done by converting the searh key value to its binary
representation and uses the two least significant bits as the bucket identifier.

1
2. Another way of organising data entries is to build a data structure that directs a search for data entries.
The figure below is an example,
Index entries
(direct search for data entries)

Index File

data entries

Data reords Data file

Clustered Tree index using alternative (2)


ALTERNATIVES TO DATA ENTRIES IN AN INDEX
A data entry k* allows the retrieval of one or more data record with key value k. There are 3 main alternatives
as follows,
 A data entry k* is an actual data record with search key value k.
 A data entry is a (k, rid) pair, where rid is the record id of a data record with search key value k.
 A data entry is a (k, rid-list) pair, where rid-list is a list of record ids of a data records with search key
value k.
TYPES OF INDEXES
The following are the types of indexes that affect the efficiency of searches using index
Clustered index
Non-clustered index
Dense index
Primary index
Secondary index
Clustered index:- This is when a file is organized so that the ordering of data records is the same as or close
to the ordering of data entries in some index. Example is an index that uses Alternative (1) is clustered. An
index that uses Alternative (2 or 3) can be a clustered index only if the data records are sorted on the search
key.
Index that maintain data entries in sorted order by search key use a collection of index entries,
organised into a tree structure, to guide searches for data entries which are stored at the leaf level of the tree
in sorted order. The diagram below is an example of clustered tree index Alt(2)
Index entries
(direct search for data entries)

Index File

data entries

Data reords Data file

Clustered Tree index using alternative (2)


2
Non-clustered index:- This is a special type of index in which the logical order of the index does not match
the physical stored order of the row on disk. This is when a file is organised so that the ordering of data
records is not the same as or close to the ordering of data entries in some index.
Index entries
(direct search for data entries)

Index File

data entries

Data reords Data file

Non-Clustered Tree index using alternative (2)

Differences between clustered and non-clustered index data structures


Clustered index non-clustered index

1). The ordering of the data record is the same or Logical order of the index does not match the physical store
Close to the ordering of data entries in the order of the rows on disk.
Index.
2). The leave nodes contains the data pages in The leave index does not contain the data page but contains
sorted order. the index rows.

3). Reorders the ways records in a table are Dose not reorders the ways records in a table are stored.
physically stored.

4). No ORDER BY statement is needed ORDER BY statement is needed

Dense index:-An index is said to be dense if it contains at least one data entry for every search key value that
appears in a record in the index file.
Sparse index:- A sparse index contains one entry for each page of records in the data file.

Note:- Alternative (1) is for dense index, while Alt(2) can be used for either dense or sparse index and Alt(3)
is only used for building a dense index.

EXAMPLE:- The example below illustration both sparse and dense indexes. A data file of records with three
fields (name, age and sal) is shown with two simple indexes on it both of which use Alt(2) for data entry
format.The first index is a sparse, clustered index on name. There is one data entry per page of data records.
Note:- Note how the order of data entries in the index corresponds to the order of records in the data file.

The second index is a dense, non-clustered index on the age field. There is one data entry in the index per
record in the data file.

Note:- Notice that the order of data entries in the index differs from the order of data record.

3
Abby, 25, 3000 22
Basu, 33, 4003 25
Bris, 30, 2001 30
Abby 33
Cass, 50 5004
Cass
Sam Dan, 22, 6003
Joes, 40, 6003 40
Sparse index on name
44
Sam, 44, 3000
44
Tony, 44, 5004
50

Dense index on age


Sparse versus Dense index
Note:- We cannot build a sparse index that is not clustered. We can have at most one sparse index. A
sparse index is much smaller than a dense index. Some very useful optimization techniques rely on
dense index.

A data is said to be inverted on a field if there is a dense secondary index on this field. A fully inverted
file is one which there is a dense secondary index on each field that does not appear in the primary key.
PRIMARY AND SECONDARY INDEXES
Primary index:-This is an index on a set of field that includes the primary key, or an index that uses Alt(1).
Secondary index:- An index that does not contain the primary key or that uses Alt(2 or 3) is called a
secondary index. It is an index that is not a primary index and may have duplicates. If a search key does not
correspond to primary key (of a relation), then multiple records can have the same search key value.
Two data entries are said to be duplicates if they have the same value for the search key field associated
with the index. If there are no duplicate and the search key contains some candidate keys then the index is
called unique index.
INDEX USING COMPOSITE SEARCH KEYS
The search key of an index can contain several fields. Such keys are called composite search keys or
concatenated keys. For example collection of employee records with fields, name age, and salary stored in a
sorted order by name is as shown below; it illustrates the difference between a composite index with key
<age, sal.>, <sal., age> and index with key <age> and <sal>.
<age sal.> <age >
index
11, 80 index
11
12, 10 12
12, 20 Name age sal.
12
12,Bob,
10 12, 10
13, 75 13
Cal, 11, 80
< sal.age> joy, 12, 20 <sal >
> 10, 12 Sue, 13, 75 10
20, 12 Data 20
75, 13 75
80, 11 80
index index
Composite key Indexes
4
Database security
Definition:- Database security is the use of a broad range of information security controls to protect database
(potentially including the data, the database applications or stored functions, the database system, the database
server and the associated network links) against compromises of their confidentiality, integrity and
availability.
Database security involves various type or categories of controls, such as technical,
procedural/administration and physical controls. Its main concerns are to ensure secrecy, integrity and
availability of data in a data base system.

Criteria for designing a secure database applications/systems


There are 3 main criteria to be considered when designing a secure database, they are as follows,
 Secrecy
 Integrity
 Availability
Secrecy:-This implies the protection of information from unauthorized disclosure either by direct retrieval or
by indirect logical inference. It also includes leakage of information by authorized user by passing secret
information to unauthorized users. This may be done intentionally or without the knowledge of the authorized
user.

Integrity:- this is to protect data from malicious or accidental modification, this includes insertion of false
data, the contamination of data and the destruction of data. Integrity constraints are rules that define the
correct state of a database and thus can protect the correctness of the database during operations.

Availability:- This is to ensure that data is available to authorized user when they are needed. Availability
include the denial of service. That is a system is not functioning in accordance with its intended purpose.
Availability is closely related to integrity because denial of service may be caused by unauthorized
destruction, modification or delay of service as well.

Types of database security controls


The following are some types of security controls on database
o Access control
o Auditing control
o Authentication
o Encryption
o Backups
Access control:-This is a method of granting access to computer users through authorization and
authentication. A familiar use of authentication and authorization is called access control.

Definition:- Access control is a process of controlling a computer system by insisting on an authentication


procedure to establish with some degree of confidence the identity of the user, granting privileges established
for that identity.
Access control system provides the following essential services,
 Authorization:- This is to specifies what a subject can do.
 Identification and authentication:-This is to ensure that only legitimate subjects can log on to
a system
 Access approval:- This is to grant access during operation by associating users with the
recourses that they are allowed to access based on the authorization policy.
 Accountability:-This is to identify what a subject (or all subject associated with a user) did.

5
Access control models
A database for an enterprise contains a great deal of information and usually has several groups of users.
Most users need access to only small part of the database to carry out their task. Allowing users unrestricted
access to all the data can be undesirable and a DBMS should provide mechanisms to control access to data.
Access control model are sometimes categorised as either,
Discretionary
Non-discretionary
The two most widely recognized models are,
 Discretionary access control (DAC)
 Mandatory access control (MAC)
Discretionary Access Control (DAC):- In this model users have privileges to access or modify objects in the
database, if they have permission, users can grant privileges to other users, and the DBMS keeps track of who
has what right.
There are two concepts in DAC as follows
File and data ownership:- In this every object in the system has an owner. In most DAC
systems, each object’s initial owner is the subject that causes it to be created. The access policy
for an object is determined by its owner.
Access right and permission:- These are the controls that an owner can assign to other subjects
for specific resources.
Mandatory Access Control (MAC):- This refers to allowing access to a resource if and only if rules exist
that allows a given user to access the resource. It is difficult to manage, but its uses are usually justified when
used to protect highly sensitive information. Examples include certain government and military information.
The method is mandatory because of the use of either rules or sensitivity labels.
Sensitivity label:- In such a system subjects and objects must have labels assign to them. A
subject sensitivity label specifies its level of trust. An object sensitivity label specifies the level
of trust required for access. In order to access a given object, the subject must have a sensitivity
level equal to or higher than the required object.
Data import and export:-Controlling the import to other system (including printer) is a critical
function of these systems, which must ensure that sensitivity labels are properly maintained
and implemented so that sensitive information is appropriate at all times.

AUDIT CONTROL:-This involves observing a database so as to be aware of the actions of database users.
Database administrators and consultants often set up auditing for security purposes, for example to ensure that
those without the permission to access information do not access it.

AUTHENTICATION:-this is the act of confirming the truth of an attribute of a datum or entity. This might
involves confirming the identity of a person or software program, tracing the origin of artefacts or ensuring
that a product is what it’s packaging and labelling claim to be, authentication often involves verifying the
validity of at least one form of identification.

ENCRYPTION:- Encryption is a process of taking data that is plain text (readable form) and using
mathematical technique to make the text unreadable. The receiver then performs a similar technique to
decrypt the message. The process is as shown in the diagram below.
Key(k) Key(k)

Plain text Encryption Cipher text Decryption Plain text

If data packets are encrypted the information that is transmitted is bigger and more bandwidth will be
required. Also there will be more overheads on devices for performing encryption and decryption.

Type of Keys in cryptography (encryption and decryption)


 Symmetric, single-key, secret-key or conventional encryption:- This is when both the sender
and the receiver use the same key for the encryption and decryption of data.

6
 Asymmetric key. two-key or public key cryptography:-This when the key used for encryption
is different from the key use for decryption.

Note:- All encryption algorithms are based on two general principles these are, Substitution and transposition
The fundamental requirements are that no information is lost and all operations are reversible.

Benefits of cryptography (encryption and decryption)


Central to all security mechanisms
Confidentiality of data
Some protocols rely on encryption to ensure availability of resources.

Roles/function of a database administrator in security


The following are the roles/functions of a database administrator,
 Development of schemas and subschema.
 Physical and logical layout of data.
 Testing and maintaining the database.
 Development of data dictionary.
 Educating and training database users.
 Mediating between users and managers.
 Creating account for users.
 Mandatory control issues related to the database.

Data security:- It is a process of ensuring that data is kept safe from corruption and that access to it is suitably
controlled. The goal of data security is to ensure privacy. It also helps in protecting personal data. Data
security is part of the larger practice of information security. It is therefore the practice of keeping data
protected from corruption and unauthorized access.

SECURITY RISK TO DATABASE


o Unauthorized or unintended activity or misuse by authorized database users, database administrators,
system/network managers, and hackers. For example inappropriate access to sensitive data, metadata
or functions within the database or inappropriate changes to the database program structure or
security.
o Malware infections causing incidents such as unauthorized access, leakage or disclosure of personal
or proprietary data or programs interruption or denial of authorized access to the database attack on
other systems and the unanticipated failure of the database services.
o Overloads performance constraint and capacity issues resulting in the inability of authorized user to
use database as intended.
o Physical damage to database severs caused by a computer room fires or floods, overheating,
lightening, accidental liquid spills, static discharge electronic breakdown.
o Design flaws and programming bugs in databases and associated programs and system creating
various security vulnerabilities (e.g. unauthorized privilege escalation) data loss/corruption,
performance degradation etc.
o Data corruption and/or loss caused by the entry of invalid data or command.
Questions
1. Explain briefly the importance of data security in a database.
2. What do you mean by data security?
3. What is risk assessment?
4. How do you identify the areas of vulnerability and develop strategies for securing data?
5. State and explain the three main objectives to be considered when designing a database.
6. State and explain the two types of access control.
7. Why is mandatory access control better than discretionary access control?
8. What do you mean by the term encryption?
9. What are the roles of a database administrator?

7
CRASH RECOVERY
Introduction:
The recovery manager is one of the hardest components of a DBMS to design and implement. This is because
it deals with a wide variety of database states and it is called on during system failures. A stable storage is
guaranteed (with very high probability) to survive crashes and media failures. A disk might get corrupted or
fail but the stable storage is still expected to retain whatever is stored in it.
One of the ways of achieving stable storage is to store the information in a set of disks rather than in a single
disk with some information duplicated so that the information will be available even if one or two of the disk
fail.
The recovery manager of a DBMS is responsible for ensuring two important properties of transaction as
follows,
 Atomicity
 Durability
 Atomicity: Crash recovery ensures atomicity by undoing the actions of transactions that do not
commit.
 Durability crash recovery ensures durability by making sure that all actions of committed transactions
survive system crashes (e.g. a core dump caused by a bus error) and media failures (e.g. a disk is
corrupted).

SYSTEM CRASH: a system crash happens when the system stops functioning in the normal way or stops
altogether. In this case the Recovery Manager and other parts of the DBMS stop functioning (e.g. a core dump
caused by a bus error)

MEDIA FAILURE: in this case the system is up and running but a particular entity of the system is not
functioning. In this case the Recovery Manager is still functioning and can start recovering from the failure
while the system is still running (e.g. a disk is corrupted)

Introduction to “ARIES”
ARIES:- Algorithm for Recovery and Isolation Exploiting Semantic. It is designed to work with a steal no-
force approach. When the Recovery Manager is invoke after a crash restart proceed in three phases as follows,
 Analysis
 Redo
 Undo
Analysis:-This is the phase in which dirty pages are identified in the buffer pool (i.e changes that have not
been written to disk) and active transaction at the time of crash.
Redo:- this is the phase that repeat all Actions starting from an appropriate point in the log, and restore the
database state to what it was at the time of the crash.
Undo:-undoes the action of transaction that did not commit, so that the database reflects only the action of
committed transactions.

Principles of ARIES recovery algorithm


There are three main principles behind ARIES recovery algorithm. They are as follows,
 Write ahead logging
 Repeating history during Redo
 Longing changes during undo

Write ahead logging:- Any changes to a database object is first recorded in the log, the record in the log must
be written to stable storage before the change to the database object is written to disk.

8
Repeating history during Redo:- Upon restart a crash ARIES retraces all action of the DBMS before the
crash and brings the system back to the exact state that it was in at the time of the crash. Then it undoes the
action of transaction that was still active at the time of the crash (effectively affecting them).

Longing changes during undo:- changes made to the database while undoing a transaction are logged in
order to ensure that such an action is not repeated in the event of repeated failures causing restarts.
THE LOG
Log:- The log also called the trail or journal is the history of actions executed by the DBMS. Physically the
log is a file of records stored in stable storage which is assumed to survive crashes. This durability can be
achieved by maintaining two or more copies of the log on different disk kept in different locations, so that the
chance of all copies of the log being simultaneously lost is negligibly small.
The most recent portion of the log is called the log tail it is kept in the main memory and is periodically
forced to stable storage. This way log records and data records are written to disk at the same granularity
(pages or set of pages).
Every log record is given a unique ID called the log sequence number (LSN). For recovery purposes
every page in the database contains the LSN of the most log record that describes a change to this page. This
LSN is called the page LSN.
A log record is written for each of the following actions.
Updating a page: after modifying the page an update type record is appended to the log tail. The page
LSN of the page is then set to the LSN of the update log record. (The page must be pinned in the
buffer pool while these actions are carried out).
Commit: When a transaction decides to commit it force-write a commit type log record containing the
transaction id. That is the log record is appended to the log, and the log tail is written to stable storage,
up to and including the commit record. The transaction is considered to have committed at the instant
that its commit log record is written to stable storage. (some additional steps must be taken e.g
removing the transactions entry in the commit log record).
Abort: when a transaction is aborted an abort type log record containing the transaction id is appended
to the log, and undo is initiated for this transaction.
End: as noted above when a transaction is abort or committed, some additional actions must be taken
beyond writing the arbour or commit log record. After all these additional steps are completed, an end
Type log record containing the transaction id is appended to the log.
Undoing an update: when a transaction is rolled back(because the transaction is aborted or during
recovery from a crash), its updates are undone. When the action described by an update log record is
undone, a compensation log record (CLR) is written.

Note: Every log record has certain fields prevLSN, transID and type. The set of all log records for a given
transaction is maintained as a linked list going back in time, using the prevLSN field, this list must be updated
whenever a log record is added. The trasID field is the ID of the transaction generating the log record and the
type field indicates the type of the log record.

OTHER RECOVERY RELATED DATA STRUCTURES


In addition to the log the following two tables contain important recovery related information.
Transaction table:- This table contains one entry for each active transaction. The entry contains (
Among other things) the transaction id, the status, and a field called lastLSN which is the LSN of the most
recent log record for this transaction. The status of a transaction can be that it is in progress, is committed, or
is aborted, ( in the latter two cases, the transaction will be removed from the table once certain clean up steps
are completed).

Dirty page table:- This table contains one entry for each dirty page in the buffer pool, that is, each page with
changes that are not yet reflected on disk. The entry contains a field recLSN, wich is the LSN of the first

9
record that caused the page to become dirty. Note that this LSN identifies the earliest log record that might
have to be redone for this page during restart from crash.

Note:- During normal operation, these tables are maintained by the transaction manager and the buffer
manager, respectively, and during restart after a crash, these tables are reconstructed in the Analysis phase of
restart.

Let us use this example to illustrate the scenario,


Transaction T1000 changes the value of bytes 21 to 23 on page P500 from ABC to DEF.
Transaction T2000 changes HIJ to KLM on page P600
Transaction T2000 changes bytes 20 to 22 from GDE to QRS on page P500
Transaction T1000 changes TUV to WXY on page P505.

T1000 update P500 3 21 ABC DEF

T2000 update P600 3 41 HIJ KLM

T2000 update P55 3 20 GDE QRS

T1000 update P505 3 21 TUV WXY

If the log is as shown above, construct,


1). The dirty page table from the log.
2). The transaction table from the log.

Solutions,

Page ID recLSN prevLSN tranID type pageID length offset b/4-image after image

P500
T1000 update P500 3 21 ABC DEF
P600
P505 T2000 update P600 3 41 HIJ KLM

Dirty page Table T2000 update P500 3 20 GDE QRS

transID iastLSN T1000 update P505 3 21 TUV WXY


T1000

T2000

Transaction Table

THE WRITE- AHEAD LOG PROTOCOL


Before writing a page to disk, every update log record that describes a change to this page must be forced to
stable storage. This is accomplished by forcing all log records up to and including the one with LSN equal to
the pageLSN to stable storage before writing the page to disk.
The importance of WAL is that it ensures that a record of every change to the database is available while
attempting to recover from a crash. If a transaction made a change and committed, the no-force approach
means that some of these changes may not have been written to disk at the time of a subsequent crash,

10
therefore without a record of these changes , there would be no way to ensure that the changes of a committed
transaction survive crashes.

Note:-A committed transaction is a transaction whose log records, including a commit record, have all been
written to stable storage.
When a transaction is committed, the log tail is forced to stable storage, even if a no-force approach is being
used. In contrast the operation is taken under a forced approach all the pages modified by the transaction ,
rather than a portion of the log that includes all its records, must be forced to disk when the transaction
commits. The set of all changed pages is typically much larger than the log tail because the size of an update
log record is close to twice the size of the changed bytes, which is likely to be much smaller than the page
size. The log is maintained as a sequential file, and thus all write to the log are sequential writes. Therefore,
the cost of forcing the log tail is much smaller than the cost of writing all changed pages to disk.

CHECKPOINTING
A checkpoint is like a snapshot of the DBMS state, and by taking checkpoints periodically, the DBMS can
reduce the amount of work to be done during restart in the event of a subsequent crash.

Checkpointing in ARIES has three steps as follows,


 Begin checkpoint.
 End checkpoint.
 Fuzzy checkpoint.
Begin checkpoint:- at this point record is written to indicate when the checkpoint starts
End checkpoint:- at this point record is constructed, including in it the current contents of the
transaction table and the dirty page table, and appended to the log.
Fuzzy checkpoint:-this third step is carried out after the end checkpoint record is written to stable
storage. A special master record containing the LSN of the begin checkpoint log record is written to a
known place on stable storage. While the end checkpoint record is being constructed, the DBMS
continues executing transaction and writing other log records, the only guarantee we have is that the
transaction table and dirty page table are accurate as of the time of the begin checkpoint record.

Note:-fuzzy checkpoint is inexpensive because it does not required quiescing the system or writing out
pages in the buffer pool unlike some other forms of checkpointing. On the other hand, the effectiveness of
this checkpointing technique is limited by earliest recLSN of pages in the dirty page table, because during
restart we must redo changes starting from the log record whose LSN is equal to this recLSN. Having a
background process that periodically write dirty pages to disk helps to limit this problem.
When the system comes back up after a crash, the restart process begins by locating the most recent
checkpoint record. For uniformity, the system always begins normal execution by taking a checkpoint, in
which the transaction table and dirty page table are both empty.

MEDIA RECOVERY
This is periodical making of the copy of the database. Because copying a large database object such as a
file can take a long time, and the DBMMS must be allowed to continue with its operations in the
meantime, creating a copy is handled in a manner similar to taking a fuzzy checkpoint.
When a database object such as a file or a page is corrupted, the copy of that object is brought up-to-
date by using the log to identify and reapply the changes of committed transactions and undo the changes
of uncommitted transactions as of the time of the media recovery operation.
The begin checkpoint LSN of the most recent complete checkpoint is recorded along with the copy of
the database object in other to minimize the work in reapplying changes of corresponding end checkpoint
record with the LSN of the begin checkpoint record and call the smaller of these two LSNs 1. We observe
that the actions recorded in all log records with LSNs less than 1 must be reflected in the copy. Thus, only
records with LSNs greater than 1 need to be reapplied to the copy.
Finally, the update of transactions that are incomplete at the time of media recovery or that were
aborted after the fuzzy copy was completed need to be undone to ensure that the page reflect only the

11
actions of committed transactions. The set of such transactions can be identified as in the Analysis phase,
and we omit the details.

DATABASE MANAGEMENT SYSTEM


CENTRALIZED DATABASE MANAGEMENT SYSTEM (CDMS)
In a centralized database management system all data is maintained at a single site and the processing of
individual transaction is assumed to be sequential.
The new trends in database are the use of parallel evaluation technique and distribution. There are four
distinct motivation behind these trends, they are,
 Performance
 Increase availability
 Distribution access to data
 Analysis of distributed data

Performance: using several resources for example CPUs and disk in parallel can significantly
improve performance.
Increase availability: - if a site containing a relation goes down the relation continues to be
available if a copy is maintained at another site.
Distribution access to data: - an organisation may have branches in several cities, this will
allow analyst to access data corresponding to different sites, there is always a locality in the
access pattern. Example a bank manager is likely to look up the account of customers at the
local branch, and this locality can be exploited by distributing the data accordingly.
Analyse of distributed data: - organisation increasingly want to examine all the data available
to them, even when it is stored across multiple sites and on multiple database system. Support
for such integrate access involves many issues even enabling access to widely distributed data
can be a challenge.

PARALLEL DATABASE MANAGEMENT SYSTEM


This is a database that is developed to improve performance through parallelization of various operations,
such as building indexes, loading data and evaluation queries Although data may be stored in a distributed
fashion, the distribution governed by performance consideration parallel database improve processing and
input / output speed by using multiple CPU and disk in parallel processing many operations are performed
simultaneously as opposed to serial processing in which the computation step are performed sequentially .

Workstation 1 Workstation 2 Workstation 3

Database

Parallel database
12
Architecture for parallel database
There are three main architectures for building parallel database, they are
 Shared memory system.
 Shared-disk system.
 Shared-nothing system
Share memory system:- In this architecture multiple CPUs are attached to an interconnection
networks and can access a common region of main memory

D D D

Interconnected network

Global shared memory

DB DB DB
Inte Inte Inte
rco Shared memory
rco architecture
rco
Shared-disk system:- This is where
nne CPU has a private
nne memory
nneand direct access to all disk through
an interconnected network.
cte cte cte
d d d
netM net
M net
M
wor wor wor
k k k
D D D

Interconnected network

DB DB DB
Inte Inte Inte
rco rcoarchitecturerco
Shared –disk
nne nne nne
Shared-nothing system:-In this cte
type of system each
cte CPU has it own local main memory and disk
cte
space, but no two CPUs can access the same storage area, all communication between CPUs is through
d d d
the network connection.
net net net
wor Interconnected
wor network wor
k k k
D D D

M M M

DB DB DB
Inte Inte Inte
13 rco
rco rco
nne Shared –nothing
nne nne
architecture
cte cte cte
d
Note: - The basic problem with the shared-disk and shared-memory architecture is interference. As more CPUs are
added existing CPUs are slowed down because of the increase contention for memory accesses and network bandwidth.
The shared-nothing architecture is now considered as the best architecture for large parallel database system
because it has been shown to provide linear speed-up, in that the time for operations decreases in proportion to the
increase in the number of CPUs and disk, and linear scale-up in that performance is sustained if the number of CPUs
and disk are increased in proportion to the amount of data, but it requires more extensive reorganization of DBMS
codes.

Advantages of Parallel Database management system


 Higher performances
 High availability
 Guarantee flexibility.
 More users.
 Distributed access to data.
 Analysis of distributed data.

Higher performances:- With more CPUs available to an application, higher speed-up and scale-up can be
achieved. Using several resources e.g. CPUs and disks in parallel can significantly improve performance.
Increase availability:- If a site containing a relation goes down the relation continues to be available if a copy
is maintained at another site.
Guarantee flexibility:- An OPs environment is extremely flexible. You can allocate or de-allocate resources as
necessary. E.g as a database demand increases you can temporarily allocate more instances. Then you can de-
allocate the resources and use them for other purposes once they are no longer required.
More users:- Parallel database technology can make it possible to overcome memory limits, enabling a single
system to serve thousands of users.
Distributed access to data:-An organization may have branches in several sites, this will allow analyst to
access data corresponding to different sites, there is always a locality in the access pattern. Example a bank
manager is likely to look up the account of customers at a local branch, and this locality can be exploited by
distributing the data accordingly.
Analysis of distributed data:-Organization increasingly wants to examine all data available to them, even
when it is stored across multiple sites and on multiple database systems.

Disadvantages of parallel database


o It is expensive to implement.
o Managing parallel database at the same time is difficult and complex.
o Huge number of resources are required to support parallelism.

DISTRIBUTED DATABASE MANAGEMENT SYSTEM


In this type of database data is physically stored across several sites, and each site is typically managed by a DBMS
that is capable of running independently of the other sites.
The location of data items and the degree of autonomy of individual sites have a significant impact on all
aspects of the system, including query, optimization and processing, concurrency control and recovery. In contrast to
parallel databases, the distribution of data is governed by factors such as local ownership and increased availability, in
addition to performance issues.
Collection of data e.g. in a database can be distributed across multiple physical locations. A distributed database
can reside on network servers on the internet, or intranet, or extranet or other company networks. The replication and
distribution of databases improves the database performance at the end-users workstation.

14
Account department Personnel department

PC1 PC9

DB DB

PC8
PC2

DB
PC7
PC3

PC4 PC5 PC6

Marketing department
Distributed database management system

The classical view of a distributed database system is that the system should make the impact of data
distribution transparent. This is carried out in the following ways,
Distributed data independence: - users should be able to ask questions without specifying where the
referenced relations or copies or fragment of the relation are located. This principle is a natural
extension of physical and logical data independence. Also queries that span multiple sites should be
optimized base on communication cost.
Distribution transaction atomicity: - users should be able to write transaction over purely local data.
In particular the effects of a transaction across sites should continue to be atomic, that is all changes
persist if the transaction commits, and non- persist if it aborts.

Types of distributed database


There are two major types of distributed database management systems, they are
Homogenous distributed database
Heterogeneous distributed database.

Homogenous distributed database:-All sites have identical software and are aware of each other and
agree to cooperate in processing user requests. Each site surrenders part of it autonomy in terms of
right to change schema or software. A homogeneous DBMS appears to the users as a single system. It
is easier to design and manage.
Conditions for homogeneous database
The operating system used at each location must be the same or compatible.
The data structure used at each location must be the same or compatible.
The database application or DBMS used at each location must be the same or
compatible.
Heterogeneous distributed database:-Different sites may use different schema and software. In
heterogeneous system different nodes may have different hardware, software and data structure at
various nodes or locations which may not be compatible. Different computers and operating systems,
database application or data nodes may be used at each of the locations. Example, one location may
have the latest relational database management technology, while another location may store data

15
using conventional files or old data of database management system. Also one location may have
window NT operating system while another may have UNIX operating system.

Architecture for distributed database system


The following are the architecture of the distributed database system
 Client server.
 Collaboration.
 Middleware.

Client server system:-This type of system has one or more client processes and one or more server
processes and a client process can send a query to any one server process. Clients are responsible for
user-interface issues, and servers manage data and execute transactions. A client process could run on
a personal computer and send queries to a server running on a mainframe.
The client server system is popular for the following reasons
 It is relatively simple to implement due to its clear separation of functionality and
because the server is centralized.
 Expensive server machine are not underutilized by dealing with mundane user-
interaction, which are now relegated to inexpensive client machine.
 Users can run a graphical user interface that is familiar to them rather than the (possibly
user interface on the server).
Collaborating server system:-this is when we have collection of database servers such that each is
capable of running transactions against local data which cooperatively execute transactions spanning
multiple servers. This opposes the client server that does not allow a single query to span multiple
servers because the client process would have to be capable of breaking such query into appropriate
sub-queries to be executed at various sites and then piecing together the answer to the sub-queries.
Middleware system:-This is designed to allow a single query to span multiple servers, without
requiring all database sever to be capable of managing such multisite execution strategies. With this
system we need just one database server that is capable of managing queries and transactions spanning
multiple servers with the remaining servers handling local queries and transactions. This type of server
is special server as a layer of software that coordinates the execution of queries and transactions across
one or more independent database servers.

Advantages of distributed database management system


Management of distributed data
Increase reliability and availability
Easier expansion.
Reflect organizational structure.
Local autonomy or site autonomy.
Protection of valuable data.
Improved performance.
Economies
Modularity
Reliable transaction.

 Management of distributed data:- This is done with different level of transparency, such as network
transparency, fragmentation transparency, replication transparency.
 Increase reliability and availability:- Collection of data in the database can be distributed across multiple
physical locations.
 Easier of expansion:-new system and devices can be added to the system without much problem.
 Reflect organizational structure:-This is possible through database fragmentation because database fragments
are located in the department they relate to.
 Local autonomy or site autonomy:-A department can control the data about them because they are familiar
with the data.
 Protection of valuable data:- If there is a catastrophic event such as fire all of the data would be in one place,
but distributed in multiple location.
16
 Improve performance:-Data is located near the site of greatest demand, and database systems themselves are
parallelized, allowing load on the database to be balanced among servers.
 Economies:-It is costless to create a network of smaller computers with the power of single large computer.
 Modularity:-System can be modified, added and remove from the distributed database without affecting other
modules.
 Reliable transactions:- This is due to replication of database.
 Continuous operation :- Even if some nodes go offline other will still continue to function depending on the
design.
 Distributed query processing:- This improves performance.
 Distributed transaction management:-.

Disadvantages of distributed database


 Complexity
 Economies
 Difficulty to maintain integrity.
 Inexperience
 Lack of standard
 Complexity in database design

Complexity:-The DBAs need to do extra work on ensuring that the distributed nature of the system is
transparent and also work must be done on maintaining multiple disparate systems instead of one
system.
Economies:- Increase complexity and more extensive infrastructure mean extra labour cost.
Security:-Remote database fragments must be secured, and they are not centralized so the remote sites
must be secured as well. The infrastructure must also be secured (e.g. by encryption the network links
between remote sites).
Difficult to maintain integrity:- In a distributed database enforcing integrity over a network may
require too much of the network’s resources to be feasible.
Inexperience:- Distributed databases are difficult to work with and as a young field there is not much
readily available experience on proper practice.
Lack of standards:- There are no tools or methodologies yet to help users to convert a centralized
DBMS into a distributed DBMS.
Complexity in database design:-The design has to consider fragmentation of data allocation of
fragment to specific sites and data replication.

STORING DATA IN DDBMS


In a distributed DBMS, relations are stored across several sites, accessing a relation that is stored at a remote
site incurs message-passing costs, and to reduce this overhead, a single relation may be partitioned or
fragmented across several sites, with fragments stored at the sites where they are most often accessed, or
replicated at each site where the relation is in high demand.

Data storage in DDBMS involves 2 concepts as follows,


 Fragmentation
 Replication.

FRAGMENTATION
This is the process of breaking a relation into smaller relations or fragments and storing the fragments (instead
of the relation itself) possibly at different sites.

Types of fragmentation
Horizontal fragmentation
Vertical fragmentation

Horizontal fragmentation:- This consist of the subset of rows of the original relation. The union of the
horizontal fragment must be equal to the original relation. They are also required to be disjoint.
17
Vertical fragmentation:- This consist of a subset of columns of the original relation. To ensure that a vertical
fragmentation is lossless-join, systems often assign a unique tuple-id to each tuple in the original relation. This
id attached to the projection of the tuple in each fragment, this extra tuple-id field which is added to each
vertical fragment guaranteed the decomposition to be lossless-join.

Note:-When a relation is fragmented we must be able to recover the original relation from the fragment.

Tid Sid Name City Age sal


11 0251 Ola Lagos 18 35
12 0361 Ade IbadanAkure 18 32 Horizontal fragmentation
13 0282 Tayo Ido 19 48
14 0355 Oke Ondo 11 20
15 0282 Osho Ado 14 39
16 0289 Orire oyo 18 42

Vertical fragmentation
REPLICATION
This is the storing copies of a relation or fragment. An entire relation can be replicated at one or more sites.
Similarly one or more fragment of a relation can be replicated at other sites. For example if a relation R is
fragmented into R1, R2, and R3. There might be just one copy of R1 whereas R2 is replicated at two other
sites and R3 is replicated at all sites.

Advantages of replication
Increase availability:- If a site containing a replica goes down we can find the same data at other sites.
Also if local copies of remote relations are available we are less vulnerable to failure of
communication links.
Faster query evaluation:- queries can execute faster by using a local copy of a relation instead of going
to remote site.
Questions
1 Explains distributed and parallel database systems.
2. State the importance of distributed and parallel database system
3. How is data stored in a distributed DBMS
4. Distinguish between the following pairs of terms
a). Horizontal fragmentation and vertical fragmentation.
b). Client-server and collaboration server architectures.
c). Homogeneous and heterogeneous distributed databases.

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy