UNIT 1 _SCSA3008_DISTRIBUTED DATABASE AND INFORMATION
UNIT 1 _SCSA3008_DISTRIBUTED DATABASE AND INFORMATION
1
SCSA3008_DISTRIBUTED DATABASE AND INFORMATION
SYSTEMS
COURSE OBJECTIVES
COURSE OUTCOMES
On completion of the course, student will be able to
CO1 - Identify the introductory distributed database concepts and its structures.
CO2 - Produce the transaction management and query processing techniques in
DDBMS.
CO3 - Develop in-depth understanding of relational databases and skills to optimize
database performance in practice.
CO4 - Critiques on each type of databases.
CO5 - Analyse, Design and present the information systems.
C06 - Designing of decision support system and tools for Business operations.
UNIT 1 9 Hrs.
INTRODUCTORY CONCEPTS AND DESIGN OF (DDBMS)
Data Fragmentation - Replication and allocation techniques for DDBMS - Methods
for designing and implementing DDBMS - designing a distributed relational
database - Architectures for DDBMS - Cluster federated - parallel databases and
client server architecture - Overview of query processing.
UNIT 2 9 Hrs.
DISTRIBUTED SECURITY AND DISTRIBUTED DATABASE
APPLICATION TECHNOLOGIES
Overview of security techniques - Cryptographic algorithms - Digital signatures -
Distributed Concurrency Control - Serializability theory - Taxonomy of concurrency
control mechanisms - Distributed deadlocks – Distributed Database Recovery -
Distributed Data Security - Web data management - Database Interoperability.
UNIT 3 9 Hrs
ADVANCED IN DISTRIBUTED SYSTEMS
Authentication in distributed systems - Protocols based on symmetric cryptosystems
- Protocols based on asymmetric cryptosystems - Password-based authentication -
Unstructured overlays - Chord distributed hash table – Content addressable
2
networks (CAN) - Tapestry - Some other challenges in P2P system design -
Tradeoffs between table storage and route lengths - Graph structures of complex
networks - Internet graphs - Generalized random graph networks.
UNIT 4 9 Hrs.
FUNDAMENTALAS OF INFORMATION SYSTEMS
Defining information – Classification of information – Presentation of information
systems – Basics of Information systems – Functions of information systems –
Components of Information systems- Limitations of Information systems –
Information System Design.
UNIT 5 9 Hrs.
ENTERPRISE COLLOBRATION SYSTEMS
Groupware – Types of groupware – Enterprise Communication tools – Enterprise
Conferencing tools – Collaborative work management tools – Information System
for Business operations – transaction processing systems – functional Information
Systems – Decision Support systems – Executive Information systems – Online
Analytical processing.
3
UNIT 1
INTRODUCTORY CONCEPTS AND DESIGN OF (DDBMS)
Data Fragmentation - Replication and allocation techniques for DDBMS - Methods
for designing and implementing DDBMS - designing a distributed relational
database - Architectures for DDBMS - Cluster federated - parallel databases and
client server architecture - Overview of query processing.
INTRODUCTION
➢ Distributed Databases
o Reality (e.g., WWW, Grids, Cloud, Sensors, Mobiles, …)
4
Fig 1.2 Distributed Dbms environment
➢ Processing overhead
Even simple operations may require a large number of communications and additional
calculations to provide uniformity in data across the sites.
➢ Data integrity
The need for updating data in multiple sites pose problems of data integrity.
Data Allocation
Data Allocation is an intelligent distribution of your data pieces, (called data fragments) to
improve database performance and Data Availability for end-users. It aims to reduce overall
costs of transaction processing while also providing accurate data rapidly in your DDBMS
systems. Data Allocation is one of the key steps in building your Distributed Database Systems.
There are two common strategies used in optimal Data Allocation: Data Fragmentation and
Data Replication.
Data Fragmentation
➢ Fragmentation is a process of disintegrating relations or tables into several partitions in
multiple sites.
➢ It divides a database into various subtables and sub relations so that data can be
distributed and stored efficiently.
➢ Database Fragmentation can be of two types: horizontal or vertical.
➢ In a horizontal fragmentation, each tuple of a relation r is assigned to one or more
fragments.
➢ In vertical fragmentation, the schema for a relation r is split into numerous smaller
6
schemas with a common candidate key and a special attribute.
Data Replication
➢ Distributed Database Replication is the process of creating and maintaining multiple
copies (redundancy) of data in different sites.
➢ The main benefit it brings to the table is that duplication of data ensures faster retrieval.
➢ This eliminates single points of failure and data loss issues if one site fails to deliver
user requests, and hence provides you and your teams with a fault-tolerant system.
➢ However, Distributed Database Replication also has some disadvantages.
➢ To ensure accurate and correct responses to user queries, data must be constantly
updated and synchronized at all times.
➢ Failure to do so will create inconsistencies in data, which can hamper business goals and
decisions for other teams.
Advantages of Fragmentation
➢ Since data is stored close to the site of usage, efficiency of the database system is
increased.
➢ Local query optimization techniques are sufficient for most queries since data is locally
available.
➢ Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
➢ When data from different fragments are required, the access speeds may be very low.
➢ In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
➢ Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order
to maintain reconstructiveness, each fragment should contain the primary key field(s) of the
table. Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students
in a Student table having the following schema.
7
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment the database as follows −
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will horizontally
fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are
used. This is the most flexible fragmentation technique since it generates fragments with
minimal extraneous information. However, reconstruction of the original table is often an
expensive task.
• At first, generate a set of horizontal fragments; then generate vertical fragments from
one or more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from
one or more of the vertical fragments.
Data Replication
Data replication is the process of storing separate copies of the database at two or more sites. It
is a popular fault tolerance technique of distributed databases.
Reduction in Network Load − Since local copies of data are available, query processing can
be done with reduced network usage, particularly during prime hours. Data updating can be
done at non-prime hours.
8
Quicker Response − Availability of local copies of data ensures quick query processing and
consequently quick response time.
Simpler Transactions − Transactions require less number of joins of tables located at different
sites and minimal coordination across the network. Thus, they become simpler in nature.
Increased Cost and Complexity of Data Updating − Each time a data item is updated, the
update needs to be reflected in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.
Undesirable Application – Database coupling − If complex update mechanisms are not used,
removing data inconsistency requires complex co-ordination at application level. This results in
undesirable application – database coupling.
• Snapshot replication
• Near-real-time replication
• Pull replication
9
• The database is accessed through a single interface as if it is a single database.
Autonomous − Each database is independent that functions on its own. They are integrated by
a controlling application and use message passing to share data updates.
Non-autonomous − Data is distributed across the homogeneous nodes and a central or master
DBMS co-ordinates data updates across the sites.
Un-federated − The database systems employ a central coordinating module through which the
databases are accessed.
Distribution − It states the physical distribution of data across the different sites.
Autonomy − It indicates the distribution of control of the database system and the degree to
which each constituent DBMS can operate independently.
Architectural Models
Case study
• A database server is the Oracle software managing a database, and a client is an
application that requests information from a server
• Each computer in a network is a node that can host one or more databases
• Each node in a distributed database system can act as a client, a server, or both,
depending on the situation
• In the figure, the host for the hq database is acting as a database server when a statement
is issued against its local data
• for example, the second statement in each transaction issues a statement against the
local dept table, but is acting as a client when it issues a statement against remote data
(for example, the first Statement in each transaction is issued against the remote table
emp in the sales database)
11
Fig 1.5 Case Study
• In contrast, an indirect connection occurs when a client connects to a server and then
accesses information contained in a database on a different server
• For example, if you connect to the hq database but access the emp table on the remote
sales database as in the figure, you can issue the following
• SELECT FROM emp@sales
• This query is indirect because the object you are accessing is not on the database to
which you are directly connected
In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas –
Schemas Present
➢ Individual internal schema definition at each site, local internal schema
➢ Enterprise view of data is described the global conceptual schema.
12
➢ Local organization of data at each site is describe in the local conceptual schema.
➢ User applications and user access to the database is supported by external schemas
➢ Local conceptual schemas are mappings of the global schema onto each site.
➢ Databases are typically designed in a top-down fashion, and, therefore all external view
definitions are made globally.
Major Components of a Peer-to-Peer System
– User Processor
– Data processor
User Processor
• User-interface handler
• responsible for interpreting user commands, and formatting the result data
• Semantic data controller
• checks if the user query can be processed.
• Global Query optimizer and decomposer
• determines an execution strategy
• Translates global queries into local one.
• Distributed execution
• Coordinates the distributed execution of the user request
Data processor
• Local query optimizer
• Acts as the access path selector
• Responsible for choosing the best access path
• Local Recovery Manager
• Makes sure local database remains consistent
• Run-time support processor
• Is the interface to the operating system and contains the database buffer
• Responsible for maintaining the main memory buffers and managing the data access.
13
Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more autonomous
database systems.
Multi-DBMS can be expressed through six levels of schemas −
• Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
• Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
• Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
• Local database View Level − Depicts public view of local data.
• Local database Conceptual Level − Depicts local data organization at each site.
• Local database Internal Level − Depicts physical data organization at each site.
Parallel Databases
• In parallel database system data processing performance is improved by using multiple
resources in parallel.
• In this CPU, the disk is used parallel to enhance the processing performance.
• Operations like data loading and query processing are performed parallel. The
centralized and client server database systems are not powerful enough to handle
applications which need fast processing.
14
• Parallel database systems have great advantages for online transaction processing and
decision support applications.
• Parallel processing divides a large task into multiple tasks and each task is performed
concurrently on several nodes. This gives a larger task to complete more quickly.
Architectural Models
There are several architectural models for parallel machines. The most important one are as
follows −
➢ Shared-memory multiple CPU − Here, the computer has several simultaneously active
CPUs that are attached to an interconnection network and share a single main memory
and a common array of disk storage.
➢ Shared disk architecture − Here, each node has its own main memory but all nodes
share mass storage. In practice, each node also has multiple processors.
➢ Shared nothing architecture − Here, each node has its own mass storage as well as
main memory.
1. Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs
that are attached to an interconnection network. They are able to share a single or global main
memory and common disk arrays. It is to be noted that, In this architecture, a single copy of a
multi-threaded operating system and multithreaded DBMS can support these multiple CPUs.
Also, the shared memory is a solid coupled architecture in which multiple CPUs share their
memory. It is also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few microprocessors in
parallel via RISC.
Advantages :
Advantages :
➢ The interconnection network is no longer a bottleneck each CPU has its own memory.
➢ Load-balancing is easier in shared disk architecture.
➢ There is better fault tolerance.
Disadvantages :
➢ If the number of CPUs increases, the problems of interference and memory contentions
also increase.
➢ There’s also exists a scalability problem.
3, Shared Nothing Architecture :
Shared Nothing Architecture is multiple processor architecture in which each processor has its
own memory and disk storage. In this, multiple CPUs are attached to an interconnection
network through a node. Also, note that no two CPUs can access the same disk area. In this
16
architecture, no sharing of memory or disk resources is done. It is also known as Massively
parallel processing (MPP).
Advantages :
19
Fig 1.12 Cluster Database Architecture
A database instance runs on every node of the cluster. Transactions running on any instance
can read or update any part of the database there is no notion of data ownership by a node.
System performance is based on the database effectively utilizing a fast interconnect, such as
the Virtual Interface Architecture (VIA), between cluster nodes. Oracle9i Real Application
Clusters (RAC) is the first successful shared-disk cluster architecture and utilizes sophisticated
Cache Fusion ™ shared-cache algorithms to allow high performance and scalability without
data or application partitioning.
• In Distributed Query processing, the data transfer cost of distributed query processing
means the cost of transferring intermediate files to other sites for processing and
therefore the cost of transferring the ultimate result files to the location where that
result’s required.
• Let’s say that a user sends a query to site S1, which requires data from its own and also
from another site S2. Now, there are three strategies to process this query which are
given below:
20
➢ We can transfer the data from S2 to S1 and then process the query
➢ We can transfer the data from S1 to S2 and then process the query
➢ We can transfer the data from S1 and S2 to S3 and then process the query.
• So the choice depends on various factors like, the size of relations and the results, the
communication cost between different sites, and at which the site result will be
utilized.
• Commonly, the data transfer cost is calculated in terms of the size of the messages. By
using the below formula, we can calculate the data transfer cost:
Where C refers to the cost per byte of data transferring and Size is the no. of bytes
transmitted.
Site1: EMPLOYEE
Site2: DEPARTMENT
DID DNAME
DID- 10 bytes
DName- 20 bytes
Total records- 50
Record Size- 30 bytes
Example : Find the name of employees and their department names. Also, find the amount of
data transfer to execute this query when the query is submitted to Site 3.
Answer : Considering the query is submitted at site 3 and neither of the two relations that is
an EMPLOYEE and the DEPARTMENT not available at site 3. So, to execute this query, we
have three strategies:
➢ Transfer both the tables that is EMPLOYEE and DEPARTMENT at SITE 3 then join
the tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500
bytes.
➢ Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the
21
result at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since
we have to transfer 1000 tuples having NAME and DNAME from site 1,
➢ Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the table at
site1 and then transfer the result at site3. The total cost is 30 * 50 + 60 * 1000 =
61500 bytes since we have to transfer 1000 tuples having NAME and DNAME from
site 1 to site 3 that is 60 bytes each.
Now, If the Optimisation criteria are to reduce the amount of data transfer, we can choose
either 1 or 3 strategies from the above.
The semi-join operation is used in distributed query processing to reduce the number of
tuples in a table before transmitting it to another site. This reduction in the number of tuples
reduces the number and the total size of the transmission that ultimately reducing the total
cost of data transfer. Let’s say that we have two tables R1, R2 on Site S1, and S2. Now, we
will forward the joining column of one table say R1 to the site where the other table say R2 is
located. This column is joined with R2 at that site. The decision whether to reduce R1 or R2
can only be made after comparing the advantages of reducing R1 with that of reducing R2.
Thus, semi-join is a well-organized solution to reduce the transfer of data in distributed query
processing.
Example : Find the amount of data transferred to execute the same query given in the above
example using semi-join operation.
Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer them
to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 25 * 1000 =
25000 bytes.
Transfer the table DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE
with this table. The size of the DEPARTMENT table is 25 * 50 = 1250
Applying the above scheme, the amount of data transferred to execute the query will be
25000 + 1250 = 26250 bytes.
22
23