Adb CH 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

CHAPTER FOUR

4. Distributed Database Systems


Learning Objectives: This chapter is deals about distributed database management system. After
completing this chapter, the learner should be familiar with the following concepts:
• Distributed database management system
• Distributed Query Processing
• Distributed Transaction Management
• Distributed Concurrency Control

4.1. Concepts of Distributed Databases Management Systems


Database development facilitates the integration of data available in an organization from a number
of applications and enforces security on data access on a single local site. But it is not always the
case that organizational data reside in one central site. These demand databases at different sites to
be integrated and synchronized with all the facilities of database approach. This is will be made
possible by computer networks and data communication optimized by internet, mobile and wireless
computing and intelligent devices. This leads to Distributed Database Systems. Distributed
database is a logically interrelated collection of shared data and physically distributed over a
computer network. A distributed database is a collection of data which are distributed over different
computers of a computer network. Distributed database management systems is a software system
that permits the management of the distributed database and makes the distribution transparent to
users. A DDBMS is the software that manages the DDB and provides an access mechanism that
makes this distribution transparent to the users.

Centralized Database

1
Distributed Database System
• What makes DDBMS different is that:
▪ The various sites are aware of each other.
▪ Each site provides a facility for executing both local and global transactions.
• The different sites can be connected physically in different topologies:
▪ Fully networked ▪ Star Network and
▪ Partially Connected ▪ Ring Network
▪ Tree Network

• The differences between these sites is based on:


▪ Installation Cost: cost of linking sites physically.
▪ Communication Cost: cost to send and receive messages and data
▪ Reliability: resistance to failure
▪ Availability: degree to which data can be accessed despite the failure.
• The distribution of the database sites could be:
1. Large Geographical Area: Long-Haul Network
It is relatively slow, less reliable and uses telephone line, microwave, satellite
2. Small Geographical Area: Local Area Network
It is higher speed, lower rate of error and use twisted pair, base band coaxial, broadband coaxial,
fiber optics.
4.1.1. Functions of a DDBMS
DDBMS have the following functionality.
• Extended Communication Services to provide access to remote sites.
• Extended Data Dictionary- to store data distribution details a need for global system catalog.

2
• Distributed Query Processing - optimization of query remote data access.
• Extended security- access control to a distributed data
• Extended Concurrency Control maintain consistency of replicated data.
• Extended Recovery Services failures of individual sites and the communication line.

4.2. Distributed Database Design


Designing a distributed database is very difficult in a multiple-site system. From the technical
viewpoint, new problems arise such as the interconnection of sites by a computer network and the
optimal distribution of data and applications to the sites for meeting the requirements of applications
and for optimizing performances. From the optimization viewpoint, the issue of decentralization is
crucial, since distributed systems typically substitute for large, centralized systems, and in thus case
distributing an application has a major impact on the organization. The following set of interrelated
questions covers the distributed database design issues:
1. Why fragment at all?
2. How should we fragment?
3. How much should we fragment?
4. Is there any way to test the correctness of decomposition?
5. How should we allocate?
6. What is the necessary information for fragmentation and allocation?
The design of a distributed database amounts to the following factors:
• Designing the conceptual schema which describes the integrated database, i.e., all the data
which are used by the database applications. Integration ≠ centralization.
• Designing the physical database, i.e., mapping the conceptual schema storage areas and
determining appropriate access methods.
• Designing the fragmentation, i.e., determining how global relations are subdivided into
horizontal, vertical, or mixed fragments.
• Designing the allocation of fragments, i.e., determining how fragments are mapped to physical
images; in this way, also the replication of fragment is determined.
In DDBMS, the placement of applications entails: Placement of the DDBMS software and
Placement of the applications that run on the DB.
4.2.1. Distributed Database Design Strategies
• Top-down: Based on designing systems from scratch

3
o Begins with the requirement analysis that defines the environment of the system and elicits
both the data and processing needs of all potential database users
o It is applicable for the design of homogeneous databases
• Bottom-up: When the databases already exist at a number of sites
o Design involves integrating databases into one database
o Integrate Local schema into Global schema
o It is ideal in the context of heterogeneous databases
4.2.2. Data Distribution/Allocation Strategies
Distributed DB stores logically related shared data and metadata at several physically independent
sites connected via network. Data allocation is the process of deciding where to allocate/store
particular data item. There are three data allocation strategies:
1. Centralized: the entire DB is located at a single site. And computers access through networks.
2. Partitioned: the DB is split into several disjoint parts (called partitions, segments or fragments)
and stored at several sites.
3. Replicated: copies of one or more partitions are stored at several sites.
• Selective: combines fragmentation (locality of reference for those which are less updated)
replication and centralization as appropriate for the data.
• Complete: database copy is made available in each site. Snapshot is one method used here.
In a distributed database system, the database is logically stored as single database but physically
fragmented on several computers. A distributed database system has the following components:
▪ Local DBMS ▪ Global System Catalog(GSC)
▪ Distributed DBMS ▪ Data communication (DC)
A distributed database system consists of a collection of sites, each of which maintains a local
database system (Local DBMS) but each local DBMS also participates in at least one global
transaction where different databases are integrated together.
▪ Local Transaction: transactions that access data only in that single site
▪ Global Transaction: transactions that access data in several sites.
4.2.3. Data Stored in DDBMS
How is data stored in DDBMS? There are several ways of storing a single relation in distributed
database systems.
4.2.3.1. Replication
• System maintains multiple copies of similar data (identical data)
4
• Stored in different sites, for faster retrieval and fault tolerance.
• Duplicate copies of the tables can be kept on each system (replicated). With this option, updates
to the tables can become involved (the copies of the tables can be read-only).
• Advantage: Availability, Increased parallelism (if only reading)
• Disadvantage: increased overhead of update
4.2.3.2. Fragmentation
Relation is partitioned into several fragments stored in distinct sites. The partitioning could be
vertical, horizontal or both. Fragmentation is correct if it fulfils the following:
a. Complete: a data item must appear in at least one fragment of a given relation R(R1, R2…Rn).
b. Reconstruction: it must be possible to reconstruct a relation from the fragments
c. Disjointness: a data item should only be found in a single fragment except for vertical
fragmentation (the primary key is repeated for reconstruction).
1. Horizontal Fragmentation
• Systems can share the responsibility of storing information from a single table with
individual systems storing groups of rows
• Performed by the Selection Operation
• The whole content of the relation is reconstructed using the UNION operation

2. Vertical Fragmentation
• Systems can share the responsibility of storing particular attributes of a table.
• Needs attribute with tuple number (the primary key value be repeated.)
• Performed by the Projection Operation

5
• The whole content of the relation is reconstructed using the Natural JOIN operation using
the attribute with Tuple number (primary key values)

3. Both (Hybrid Fragmentation)


• A system can share the responsibility of storing particular attributes of a subset of records in
a given relation.
• Performed by projection then selection or selection then projection relational algebra operators.
• Reconstruction is made by combined effect of Union and natural join operators.
4.3. Data Transparency
The degree to which system user may remain unaware of the details of how and where the data
items are stored in a distributed system.
1. Distribution transparency Even though there are many systems they appear as one- seen as a
single, logical entity.
2. Replication transparency Copies of data floating around everywhere also seem like just one
copy to the developers and users
3. Fragmentation transparency A table that is actually stored in parts everywhere across sites
may seem like just a single table in a single location
4. Location Transparency- the user doesn’t need to know where a data item is physically located.

4.4. How does DDBMS work?


Distributed computing can be difficult to implement, particularly for replicated data that can be
updated from many systems. In order to operate a distributed database system has to take care of:
• Distributed Query Processing

6
• Distributed Transaction Management
• Replication Data Management: If you are going to have copies of data on many machines
how often does the data get updated if it is changed in another system? Who is in charge of
propagating the update to the data?
• Distributed Database Recovery: If one machine goes down how does that affect the others.
• Security: Just like any computer network, a distributed system needs to have a common way
to validate users entering from any computer in the network of servers.
• Common Data-Dictionary: Your schema now has to be distinguished and work in connection
to schemas created on many systems.
4.4.1. Homogeneous and Heterogeneous Distributed Databases
1. Homogeneous Distributed Database
In a homogeneous database, all different sites store database identically. The operating system,
database management system and the data structures used all are same at all sites. Hence, they’re
easy to manage. All sites have identical software (DBMS). Homogeneous DDBs are aware of each
other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy
in terms of right to change schemas or software. Appears to the user as a single system
2. Heterogeneous Distributed Database
In a heterogeneous distributed database, different sites can use different schema and software that
can lead to problems in query processing and transactions. Also, a particular site might be
completely unaware of the other sites. Different computers may use a different operating system,
different database application. They may even use different data models for the database. Hence,
translations are required for different sites to communicate. Different sites may use different
schemas and software (DBMS) the difference in schema is a major problem for query processing,
and difference in software is a major problem for transaction processing. Sites may not be aware of
each other and may provide only limited facilities for cooperation in transaction processing. May
need gateways to interface one another.
4.4.2. Date’s Twelve Rules for a DDBMS
In this section, list Date’s twelve rules (objectives) for DDBMSs. The basis for these rules is that a
distributed DBMS should feel like a non-distributed DBMS to the user. These rules are similar to
Codd’s twelve rules for relational systems. The fundamental principles are:

7
1. Local autonomy: The sites in a distributed system should be autonomous. In this context,
autonomy means that: Local data is locally owned and managed, local operations remain purely
local and All operations at a given site are controlled by that site.
2. No reliance on a central site: there should be no central servers for services such as transaction
management, deadlock detection, query optimization, and management of the global system
catalog.
3. Continuous operation: Ideally, there should never be a need for a planned system shutdown, for
operations such as: Adding or removing a site from the system and dynamic creation and
deletion of fragments at one or more sites.
4. Location independence: Location independence is equivalent to location transparency. The user
should be able to access the database from any site. Furthermore, the user should be able to
access all data as if it were stored at the user’s site, no matter where it is physically stored.
5. Fragmentation independence: The user should be able to access the data, no matter how it is
fragmented.
6. Replication independence: The user should be unaware that data has been replicated. Thus, the
user should not be able to access a particular copy of a data item directly, nor should the user
have to specifically update all copies of a data item.
7. Distributed query processing: The system should be capable of processing queries that reference
data at more than one site.
8. Distributed transaction processing: The system should support the transaction as the unit of
recovery. The system should ensure that both global and local transactions conform to the ACID
rules for transactions, namely: atomicity, consistency, isolation, and durability.
9. Hardware independence: It should be possible to run the DDBMS on a variety of hardware
platforms.
10. Operating system independence: It should be possible to run the DDBMS on a variety of
operating systems.
11. Network independence: It should be possible to run the DDBMS on a variety of disparate
communication networks.
12. Database independence It should be possible to have a DDBMS made up of different local
DBMSs, perhaps supporting different underlying data models. In other words, the system
should support heterogeneity.

8
4.5. Distributed Query Processing
Objective of distributed query processing is to transform a high-level query on a distributed
database into low level language on local databases.
• Minimize a cost function: I/O cost + CPU cost + communication cost
• These might have different weights in different distributed environments
• Wide area networks
• Communication cost will dominate: Low bandwidth, Low speed, and High protocol
overhead.
• Local area networks: Communication cost not that dominant and total cost function should
be considered.

Query Processing Components:


• Query language that is used SQL.
• Query execution methodology: The steps that one goes through in executing high level
(declarative) user queries.
• Query optimization: How do we determine the “best” execution plan?
Query processing problem:

Example: SELECT ENAME FROM EMP, ASG WHERE EMP.ENO=ASG.ENO AND DUR > 37;

9
Cost of Alternatives:
Assume
• Size(EMP) = 400, size(ASG) = 1000
• Tuple access cost = 1 unit; tuple transfer cost = 10 units
Strategy 1
• Produce ASGi: (10+10)∗tuple access cost 20
• Transfer ASGi to the sites of EMP: (10+10)∗tuple transfer cost 200
• Produce EMPi : (10+10) ∗tuple access cost∗2 40
• Transfer EMPi to result site: (10+10) ∗tuple transfer cost 200
• Total cost 460
Strategy 2
• Transfer EMP to site 5:400∗tuple transfer cost 4,000
• Transfer ASGi to site 5 :1000∗tuple transfer cost 10,000
• Produce ASGi:1000∗tuple access cost 1,000
• Join EMPi and ASGi:400∗20∗tuple access cost 8,000
• Total cost 23,000
4.6. Distributed Transaction Management and Recovery
There are different strategies to process a specific query, which in turn increase the performance of
the system by minimizing processing time and cost. In addition to the cost estimates we have for a
centralized database (disk access, relation size, etc), we have to consider the following in distributed
query processing:
• Cost of data transmission over the huge network
• Gain of parallel processing of a single query
For the case of Replicated data allocation, even though parallel processing is used to increase
performance, update will have a great impact since all the sites containing the data item should be
updated. For the case of fragmentation, update works more like the centralized database but
reconstruction of the whole relation will require accessing data from all sites containing part of the

10
relation. Let the distributed database has three sites (S1, S2, and S3). And two relations,
EMPLOYEE and DEPARTMENT are located at S1 and S2 respectively without any fragmentation.
And a query is initiated from S3 to retrieve employees [First Name (15 bytes long), Last name (15
bytes long) and Department name (10 bytes long) total of 40 bytes with the department they are
working in.
Let: For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes’ long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes’ long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform the join there: needs transfer
of 10,000*100+100*35=1,003,500 byte.
2. Transfer the EMPLOYEE to S2, perform the join there which will have 40*10,000 = 400,000
bytes and transfer the result to S3. We need 1,000,000+400,000=1,400,000 byte to be
transferred
3. Transfer the DEPARTMENT to S1, perform the join there which will have 40*10,000 =
400,000 bytes and transfer the result to S3. We need 3,500+400,000=403,500 byte to be
transferred.
Then one can select the strategy that will reduce the data transfer cost for this specific query. Other
steps of optimization may also be included to make the processing more efficient by reducing the
size of the relations using projection.
4.5.1. Transaction Management
Transaction is a logical unit of work constituted by one or more operations executed by a single
user. A transaction begins with the user's first executable query statement and ends when it is
committed or rolled back. A Distributed Transaction is a transaction that includes one or more
statements that, individually or as a group, update data on two or more distinct nodes of a distributed
database. Representation of Query in Distributed Database
SQL Statement Object Database Domain
SELECT * FROM dept@sales.midroc.telecom.et; Department Inventory wku.edu.et;

11
There are two types of transaction in DDBMS to access data from other sites:
1. Remote Transaction: contains only statements that access a single remote node. Thus, Remote
Query statement is a query that selects information from one or more remote tables, all of which
reside at the same remote node or site. For example, the following query accesses data from the
dept table in the Addis schema (the site) of the remote sales database:
• SELECT * FROM Addis.department@inventory.wku.edu.et;
A remote update statement is an update that modifies data in one or more tables, all of which are
collocated at the same remote node. For example, the following query updates the branch table in
the Addis schema of the remote sales database:
UPDATE Addis.dept@ sales.midroc.telecom.et SET loc = 'Arada' WHERE BranchNo = 5;
2. Distributed Transaction: contains statements that access more than one node.
A distributed query statement retrieves information from two or more nodes. If all statements of a
transaction reference only a single remote node, the transaction is remote, not distributed. A
database must guarantee that all statements in a transaction, distributed or non-distributed, either
commit or roll back as a unit. The effects of an ongoing transaction should be invisible to all other
transactions at all nodes; this transparency should be true for transactions that include any type of
operation, including queries, updates, or remote procedure calls. For example, the following query
accesses data from the local database as well as the remote sales database:
• SELECT ename, dname FROM Dessie.emp DS, Addis.dept@ sales.midroc.telecom.et AD
WHERE DS.deptno = AD.deptno; {Employee data is stored in Dessie and Sales data is stored
in Addis, there is an employee responsible for each sale}
Different kinds of SQL statements
• Remote query: Select client_nm from clients@accounts.motorola.com;

• Distributed query: select project_name, student_nm from


intership@accounts.motorola.com i, student s where s.stu_id = i.stu_id
• Remote Update: update intership@accounts.motorola.com set stu_id = '242'
where stu_id = '200'
• Distributed Update: update intership@accounts.motorola.com set stu_id = '242' where
stu_id = '200' update student set stu_id = '242' where stu_id = '200' commit
4.5.2. Distributed Concurrency Control
Distributed concurrency control mechanism of a distributed DBMS ensures the consistencies of the
database. It is maintained in a multiuser distributed environment. There are different distributed
12
concurrency control algorithms: There are various techniques used for concurrency control in
centralized database systems. The techniques in distributed database system are similar with the
centralized approach with additional implementation requirements or modifications. The main
difference or the change that should be incorporated is the way the lock manager is implemented
and how it functions. There are different schemes for concurrency control in DDBS.
1. Non-Replicated Scheme
• No data is replicated in the system
• All sites will maintain a local lock manager (local lock and unlock)
• If site Si needs a lock on data in site Sj it sends message to lock manager of site Sj and the
locking will be handled by site Sj
• All the locking and unlocking principles are handled by the local lock manager in which the
data object resides.
• It is simple to implement
• It needs three message transfers:
o To request a lock
o To notify grant of lock
o To request unlock
2. Single Coordinate Approach
• The system chooses one single lock manager that resides in one of the sites (Si)
• All locks and unlocks requests are made at site Si where the lock manager resides
• It is simple to implement
• It needs two message transfers:
o To request a lock
o To request unlock
• It is simple deadlock handling
• It could be a bottleneck since all processes are handled at one site
• It is vulnerable/at risk if the site with the lock manager fails

4.7. Advantages and Disadvantages of DDBMS


4.7.1. Advantages of DDBMS
• Transparent management of distributed, fragmented, and replicated data

13
• Improved performance
• Easier and more economical system expansion
• Many existing systems: Possibly there are many different existing system, with possible
different kinds of systems (Oracle, Informix, others) that need to be used together.
• Data sharing and distributed control:
o User at one site may be able to access data that is available at another site.
o Each site can retain some degree of control over local data
o We will have local as well as global database administrator
• Reliability and availability of data
o Improved reliability/availability through distributed transactions
o If one site fails, the rest can continue operation as long as transaction does not
demand data from the failed system and the data is not replicated in other sites
• Speedup of query processing
o If a query involves data from several sites, it may be possible to split the query
into sub-queries that can be executed at several sites which is parallel processing
o Query can be sent to the least heavily loaded sites
• Expansion(Scalability): It is scalable. In a distributed environment you can easily expand
by adding more machines to the network.
4.7.2. Disadvantages of DDBMS
• Software Development Cost: Is difficult to install, thus is costly
• Greater Potential for Bugs: Parallel processing may endanger correctness of algorithms
• Increased Processing Overhead: Exchange of message between sites high
communication latency and Due to communication jargons.
• Communication problems
• Increased Complexity and Data Inconsistency Problems: Since clients can read and
modify closely related data stored in different database instances concurrently.
• Security Problems: network and replicated data security.

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy