Adb CH 4
Adb CH 4
Adb CH 4
Centralized Database
1
Distributed Database System
• What makes DDBMS different is that:
▪ The various sites are aware of each other.
▪ Each site provides a facility for executing both local and global transactions.
• The different sites can be connected physically in different topologies:
▪ Fully networked ▪ Star Network and
▪ Partially Connected ▪ Ring Network
▪ Tree Network
2
• Distributed Query Processing - optimization of query remote data access.
• Extended security- access control to a distributed data
• Extended Concurrency Control maintain consistency of replicated data.
• Extended Recovery Services failures of individual sites and the communication line.
3
o Begins with the requirement analysis that defines the environment of the system and elicits
both the data and processing needs of all potential database users
o It is applicable for the design of homogeneous databases
• Bottom-up: When the databases already exist at a number of sites
o Design involves integrating databases into one database
o Integrate Local schema into Global schema
o It is ideal in the context of heterogeneous databases
4.2.2. Data Distribution/Allocation Strategies
Distributed DB stores logically related shared data and metadata at several physically independent
sites connected via network. Data allocation is the process of deciding where to allocate/store
particular data item. There are three data allocation strategies:
1. Centralized: the entire DB is located at a single site. And computers access through networks.
2. Partitioned: the DB is split into several disjoint parts (called partitions, segments or fragments)
and stored at several sites.
3. Replicated: copies of one or more partitions are stored at several sites.
• Selective: combines fragmentation (locality of reference for those which are less updated)
replication and centralization as appropriate for the data.
• Complete: database copy is made available in each site. Snapshot is one method used here.
In a distributed database system, the database is logically stored as single database but physically
fragmented on several computers. A distributed database system has the following components:
▪ Local DBMS ▪ Global System Catalog(GSC)
▪ Distributed DBMS ▪ Data communication (DC)
A distributed database system consists of a collection of sites, each of which maintains a local
database system (Local DBMS) but each local DBMS also participates in at least one global
transaction where different databases are integrated together.
▪ Local Transaction: transactions that access data only in that single site
▪ Global Transaction: transactions that access data in several sites.
4.2.3. Data Stored in DDBMS
How is data stored in DDBMS? There are several ways of storing a single relation in distributed
database systems.
4.2.3.1. Replication
• System maintains multiple copies of similar data (identical data)
4
• Stored in different sites, for faster retrieval and fault tolerance.
• Duplicate copies of the tables can be kept on each system (replicated). With this option, updates
to the tables can become involved (the copies of the tables can be read-only).
• Advantage: Availability, Increased parallelism (if only reading)
• Disadvantage: increased overhead of update
4.2.3.2. Fragmentation
Relation is partitioned into several fragments stored in distinct sites. The partitioning could be
vertical, horizontal or both. Fragmentation is correct if it fulfils the following:
a. Complete: a data item must appear in at least one fragment of a given relation R(R1, R2…Rn).
b. Reconstruction: it must be possible to reconstruct a relation from the fragments
c. Disjointness: a data item should only be found in a single fragment except for vertical
fragmentation (the primary key is repeated for reconstruction).
1. Horizontal Fragmentation
• Systems can share the responsibility of storing information from a single table with
individual systems storing groups of rows
• Performed by the Selection Operation
• The whole content of the relation is reconstructed using the UNION operation
2. Vertical Fragmentation
• Systems can share the responsibility of storing particular attributes of a table.
• Needs attribute with tuple number (the primary key value be repeated.)
• Performed by the Projection Operation
5
• The whole content of the relation is reconstructed using the Natural JOIN operation using
the attribute with Tuple number (primary key values)
6
• Distributed Transaction Management
• Replication Data Management: If you are going to have copies of data on many machines
how often does the data get updated if it is changed in another system? Who is in charge of
propagating the update to the data?
• Distributed Database Recovery: If one machine goes down how does that affect the others.
• Security: Just like any computer network, a distributed system needs to have a common way
to validate users entering from any computer in the network of servers.
• Common Data-Dictionary: Your schema now has to be distinguished and work in connection
to schemas created on many systems.
4.4.1. Homogeneous and Heterogeneous Distributed Databases
1. Homogeneous Distributed Database
In a homogeneous database, all different sites store database identically. The operating system,
database management system and the data structures used all are same at all sites. Hence, they’re
easy to manage. All sites have identical software (DBMS). Homogeneous DDBs are aware of each
other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy
in terms of right to change schemas or software. Appears to the user as a single system
2. Heterogeneous Distributed Database
In a heterogeneous distributed database, different sites can use different schema and software that
can lead to problems in query processing and transactions. Also, a particular site might be
completely unaware of the other sites. Different computers may use a different operating system,
different database application. They may even use different data models for the database. Hence,
translations are required for different sites to communicate. Different sites may use different
schemas and software (DBMS) the difference in schema is a major problem for query processing,
and difference in software is a major problem for transaction processing. Sites may not be aware of
each other and may provide only limited facilities for cooperation in transaction processing. May
need gateways to interface one another.
4.4.2. Date’s Twelve Rules for a DDBMS
In this section, list Date’s twelve rules (objectives) for DDBMSs. The basis for these rules is that a
distributed DBMS should feel like a non-distributed DBMS to the user. These rules are similar to
Codd’s twelve rules for relational systems. The fundamental principles are:
7
1. Local autonomy: The sites in a distributed system should be autonomous. In this context,
autonomy means that: Local data is locally owned and managed, local operations remain purely
local and All operations at a given site are controlled by that site.
2. No reliance on a central site: there should be no central servers for services such as transaction
management, deadlock detection, query optimization, and management of the global system
catalog.
3. Continuous operation: Ideally, there should never be a need for a planned system shutdown, for
operations such as: Adding or removing a site from the system and dynamic creation and
deletion of fragments at one or more sites.
4. Location independence: Location independence is equivalent to location transparency. The user
should be able to access the database from any site. Furthermore, the user should be able to
access all data as if it were stored at the user’s site, no matter where it is physically stored.
5. Fragmentation independence: The user should be able to access the data, no matter how it is
fragmented.
6. Replication independence: The user should be unaware that data has been replicated. Thus, the
user should not be able to access a particular copy of a data item directly, nor should the user
have to specifically update all copies of a data item.
7. Distributed query processing: The system should be capable of processing queries that reference
data at more than one site.
8. Distributed transaction processing: The system should support the transaction as the unit of
recovery. The system should ensure that both global and local transactions conform to the ACID
rules for transactions, namely: atomicity, consistency, isolation, and durability.
9. Hardware independence: It should be possible to run the DDBMS on a variety of hardware
platforms.
10. Operating system independence: It should be possible to run the DDBMS on a variety of
operating systems.
11. Network independence: It should be possible to run the DDBMS on a variety of disparate
communication networks.
12. Database independence It should be possible to have a DDBMS made up of different local
DBMSs, perhaps supporting different underlying data models. In other words, the system
should support heterogeneity.
8
4.5. Distributed Query Processing
Objective of distributed query processing is to transform a high-level query on a distributed
database into low level language on local databases.
• Minimize a cost function: I/O cost + CPU cost + communication cost
• These might have different weights in different distributed environments
• Wide area networks
• Communication cost will dominate: Low bandwidth, Low speed, and High protocol
overhead.
• Local area networks: Communication cost not that dominant and total cost function should
be considered.
Example: SELECT ENAME FROM EMP, ASG WHERE EMP.ENO=ASG.ENO AND DUR > 37;
9
Cost of Alternatives:
Assume
• Size(EMP) = 400, size(ASG) = 1000
• Tuple access cost = 1 unit; tuple transfer cost = 10 units
Strategy 1
• Produce ASGi: (10+10)∗tuple access cost 20
• Transfer ASGi to the sites of EMP: (10+10)∗tuple transfer cost 200
• Produce EMPi : (10+10) ∗tuple access cost∗2 40
• Transfer EMPi to result site: (10+10) ∗tuple transfer cost 200
• Total cost 460
Strategy 2
• Transfer EMP to site 5:400∗tuple transfer cost 4,000
• Transfer ASGi to site 5 :1000∗tuple transfer cost 10,000
• Produce ASGi:1000∗tuple access cost 1,000
• Join EMPi and ASGi:400∗20∗tuple access cost 8,000
• Total cost 23,000
4.6. Distributed Transaction Management and Recovery
There are different strategies to process a specific query, which in turn increase the performance of
the system by minimizing processing time and cost. In addition to the cost estimates we have for a
centralized database (disk access, relation size, etc), we have to consider the following in distributed
query processing:
• Cost of data transmission over the huge network
• Gain of parallel processing of a single query
For the case of Replicated data allocation, even though parallel processing is used to increase
performance, update will have a great impact since all the sites containing the data item should be
updated. For the case of fragmentation, update works more like the centralized database but
reconstruction of the whole relation will require accessing data from all sites containing part of the
10
relation. Let the distributed database has three sites (S1, S2, and S3). And two relations,
EMPLOYEE and DEPARTMENT are located at S1 and S2 respectively without any fragmentation.
And a query is initiated from S3 to retrieve employees [First Name (15 bytes long), Last name (15
bytes long) and Department name (10 bytes long) total of 40 bytes with the department they are
working in.
Let: For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes’ long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes’ long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform the join there: needs transfer
of 10,000*100+100*35=1,003,500 byte.
2. Transfer the EMPLOYEE to S2, perform the join there which will have 40*10,000 = 400,000
bytes and transfer the result to S3. We need 1,000,000+400,000=1,400,000 byte to be
transferred
3. Transfer the DEPARTMENT to S1, perform the join there which will have 40*10,000 =
400,000 bytes and transfer the result to S3. We need 3,500+400,000=403,500 byte to be
transferred.
Then one can select the strategy that will reduce the data transfer cost for this specific query. Other
steps of optimization may also be included to make the processing more efficient by reducing the
size of the relations using projection.
4.5.1. Transaction Management
Transaction is a logical unit of work constituted by one or more operations executed by a single
user. A transaction begins with the user's first executable query statement and ends when it is
committed or rolled back. A Distributed Transaction is a transaction that includes one or more
statements that, individually or as a group, update data on two or more distinct nodes of a distributed
database. Representation of Query in Distributed Database
SQL Statement Object Database Domain
SELECT * FROM dept@sales.midroc.telecom.et; Department Inventory wku.edu.et;
11
There are two types of transaction in DDBMS to access data from other sites:
1. Remote Transaction: contains only statements that access a single remote node. Thus, Remote
Query statement is a query that selects information from one or more remote tables, all of which
reside at the same remote node or site. For example, the following query accesses data from the
dept table in the Addis schema (the site) of the remote sales database:
• SELECT * FROM Addis.department@inventory.wku.edu.et;
A remote update statement is an update that modifies data in one or more tables, all of which are
collocated at the same remote node. For example, the following query updates the branch table in
the Addis schema of the remote sales database:
UPDATE Addis.dept@ sales.midroc.telecom.et SET loc = 'Arada' WHERE BranchNo = 5;
2. Distributed Transaction: contains statements that access more than one node.
A distributed query statement retrieves information from two or more nodes. If all statements of a
transaction reference only a single remote node, the transaction is remote, not distributed. A
database must guarantee that all statements in a transaction, distributed or non-distributed, either
commit or roll back as a unit. The effects of an ongoing transaction should be invisible to all other
transactions at all nodes; this transparency should be true for transactions that include any type of
operation, including queries, updates, or remote procedure calls. For example, the following query
accesses data from the local database as well as the remote sales database:
• SELECT ename, dname FROM Dessie.emp DS, Addis.dept@ sales.midroc.telecom.et AD
WHERE DS.deptno = AD.deptno; {Employee data is stored in Dessie and Sales data is stored
in Addis, there is an employee responsible for each sale}
Different kinds of SQL statements
• Remote query: Select client_nm from clients@accounts.motorola.com;
13
• Improved performance
• Easier and more economical system expansion
• Many existing systems: Possibly there are many different existing system, with possible
different kinds of systems (Oracle, Informix, others) that need to be used together.
• Data sharing and distributed control:
o User at one site may be able to access data that is available at another site.
o Each site can retain some degree of control over local data
o We will have local as well as global database administrator
• Reliability and availability of data
o Improved reliability/availability through distributed transactions
o If one site fails, the rest can continue operation as long as transaction does not
demand data from the failed system and the data is not replicated in other sites
• Speedup of query processing
o If a query involves data from several sites, it may be possible to split the query
into sub-queries that can be executed at several sites which is parallel processing
o Query can be sent to the least heavily loaded sites
• Expansion(Scalability): It is scalable. In a distributed environment you can easily expand
by adding more machines to the network.
4.7.2. Disadvantages of DDBMS
• Software Development Cost: Is difficult to install, thus is costly
• Greater Potential for Bugs: Parallel processing may endanger correctness of algorithms
• Increased Processing Overhead: Exchange of message between sites high
communication latency and Due to communication jargons.
• Communication problems
• Increased Complexity and Data Inconsistency Problems: Since clients can read and
modify closely related data stored in different database instances concurrently.
• Security Problems: network and replicated data security.
14