0% found this document useful (0 votes)
5 views16 pages

Ddbms-Unit 1 Part2

Distributed data processing involves multiple computers handling data across various locations, allowing for local processing needs and centralized databases. Advantages include increased availability and user control, while disadvantages involve complexities in management and potential for component incompatibility. Key applications span big data analytics, cloud computing, and e-commerce, highlighting the importance of distributed database systems in modern technology.

Uploaded by

Prakash Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Ddbms-Unit 1 Part2

Distributed data processing involves multiple computers handling data across various locations, allowing for local processing needs and centralized databases. Advantages include increased availability and user control, while disadvantages involve complexities in management and potential for component incompatibility. Key applications span big data analytics, cloud computing, and e-commerce, highlighting the importance of distributed database systems in modern technology.

Uploaded by

Prakash Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

#What is distributed data processing? Explain advantages and disadvantages.

• Distributed data processing allows multiple computers to be used anywhere.


Distributed data processing allows multiple computers to be working among
multiple geographically separate sites where local computers handle local
processing needs.
• Distributed processing is a database’s logical processing shared among two or
more physically independent sites that are connected through a network.
• Data processing is organized around several information processing units
distributed throughout the company. End user controls the information processing
unit. DDP uses centralized database.

Advantages:

1. Availability
2. Resource sharing
3. Incremental growth
4. Increased user involvement and control
5. End-user productivity
6. Distance and location independence
7. Privacy and security

Disadvantages:

1. More difficulty test and failure diagnosis


2. More components and dependence on communication means more points of
failure
3. Incompatibility of components
4. Incompatibility of data
5. More complex management and control
6. Difficulty in control of corporate information resources
7. Suboptimal procurement
8. Duplication of effort
#What is data independence ?

• The ability to modify a scheme definition in one level without affecting a scheme
definition in a higher level is called data independence. Data independence is a
fundamental form of transparency that we look for within a DBMS. Data definition
occurs at two levels. At one level the logical structure of the data are specified, and
at the other level its physical structure.
• The ability to modify a scheme definition in one level without affecting a scheme
definition in a higher level is called data independence.
• There are two kinds : Physical data independence and logical data independence.
1. Physical data independence : The ability to modify the physical scheme without
causing application programs to be rewritten. Modifications at this level are usually
to improve performance.
2. Logical data independence : The ability to modify the conceptual scheme without
causing application programs to be rewritten. Usuallydone when logical structure of
database is altered.
Logical data independence is harder to achieve as the application programs are usually
heavily dependent on the logical structure of the data.

# COMPONENTS OF DISTRIBUTED DBMS.

- Computer workstations or remote devices (sites or nodes) that form the network system.
The distributed database system must be independent of the computer system hardware.

- Network hardware and software components that reside in each workstation or device.
The network components allow all sites to interact and exchange data. Because the
components-computers, operating systems, network hardware, and so on-are likely to be
supplied by different vendors, it is best to ensure that distributed database functions can
be run on multiple platforms.

- Communications media that carry the data from one node to another. The DDBMS must
be communications media-independent; that is, it must be able to support several types of
communications media.

- The Transaction Processor (TP), which is the software component found in each
computer or device that requests data. The transaction processor receives and processes
the application's data requests (remote and local). The TP is also known as the Application
Processor (AP) or the Transaction Manager (TM).

Data Processor (DP):This is a software component residing on each computer or device


that stores and retrieves data located at the site. It is also known as the Data Manager
(DM). A data processor may even be a centralized DBMS.

# Applications of Distributed Data Management Systems (DDMS)

Distributed Data Management Systems are used wherever large-scale, reliable, and
efficient data storage and processing are required across multiple machines. Here are the
key application areas:

1. Big Data Analytics

• Use Case: Processing massive datasets (e.g., logs, clickstreams, sensor data)
• Examples: Hadoop, Apache Spark
• Industries: E-commerce, telecom, social media, finance

2. Cloud Computing

• Use Case: Storage and access of distributed data over the cloud
• Examples: Amazon S3, Google Bigtable, Microsoft Azure Cosmos DB
• Industries: All modern tech-enabled businesses

3. Social Media Platforms

• Use Case: Managing user profiles, messages, media files, real-time feeds
• Examples: Facebook (Cassandra), Twitter (Manhattan), Instagram (PostgreSQL +
caching layers)

4. E-commerce Systems

• Use Case: Product catalogs, user sessions, inventory management, transaction


logs
• Examples: Amazon, Flipkart, eBay using distributed databases and caching
systems
5. Financial Services

• Use Case: Fraud detection, transaction processing, risk analysis in real-time


• Examples: Banks using distributed systems for real-time data streams and
decision-making

6. Healthcare Systems

• Use Case: Storing and managing patient records, research data, imaging files
across institutions
• Examples: Distributed health record systems, genomic data analysis platforms

7. IoT and Smart Devices

• Use Case: Managing data from millions of sensors and devices


• Examples: Smart city infrastructure, smart homes, industrial IoT platforms

8. Search Engines and Web Crawling

• Use Case: Storing and indexing the entire internet content


• Examples: Google Bigtable, Elasticsearch clusters, Solr

9. Content Delivery Networks (CDNs)

• Use Case: Efficient data replication and retrieval across global users
• Examples: Akamai, Cloudflare, Netflix’s Open Connect

10. Scientific Research & High-Performance Computing

• Use Case: Simulations, experiments, and collaborative research across countries


• Examples: CERN, NASA using distributed data stores and compute cluster

#Advantages / Promises of DDBS

Distributed Database Systems deliver the following advantages:

• Higher reliability
• Improved performance
• Easier system expansion
• Transparency of distributed and replicated data
Higher reliability

• Replication of components
• No single points of failure
• e.g., a broken communication link or processing element does not bring down the
entire system
• Distributed transaction processing guarantees the consistency of the database and
concurrency

Improved performance

• Proximity of data to its points of use


 Reduces remote access delays
 Requires some support for fragmentation and replication
• Parallelism in execution
 Inter-query parallelism
 Intra-query parallelism
• Update and read-only queries influence the design of DDBSs substantially
 If mostly read-only access is required, as much as possible of the data
should be replicated
 Writing becomes more complicated with replicated data

Easier system expansion

• Issue is database scaling


• Emergence of microprocessor and workstation technologies
 Network of workstations much cheaper than a single mainframe computer
• Data communication cost versus telecommunication cost
• Increasing database size
Transparency

• Refers to the separation of the higher-level semantics of the system from the lower-
level implementation issues
• A transparent system “hides” the implementation details from the users.
• A fully transparent DBMS provides high-level support for the development of
complex applications.

(a) User wants to see one database

(b) Programmer sees many databases

Various forms of transparency can be distinguished for DDBMSs:

• Network transparency (also called distribution transparency)


 Location transparency
 Naming transparency
• Replication transparency
• Fragmentation transparency
• Transaction transparency
 Concurrency transparency
 Failure transparency
• Performance transparency

Network/Distribution transparency allows a user to perceive a DDBS as a single, logical


entity

• The user is protected from the operational details of the network (or even does not
know about the existence of the network)
• The user does not need to know the location of data items and a command used to
perform a task is independent from the location of the data and the site the task is
performed (location transparency)
• A unique name is provided for each object in the database (naming transparency)
 In absence of this, users are required to embed the location name as part of
an identifier
Different ways to ensure naming transparency:

• Solution 1: Create a central name server; however, this results in


 loss of some local autonomy
 central site may become a bottleneck
 low availability (if the central site fails remaining sites cannot create new
objects)
• Solution 2: Prefix object with identifier of site that created it
 e.g., branch created at site S1 might be named S1.BRANCH
 Also need to identify each fragment and its copies
 e.g., copy 2 of fragment 3 of Branch created at site S1 might be referred to as
S1.BRANCH.F3.C2
• An approach that resolves these problems uses aliases for each database object
 Thus, S1.BRANCH.F3.C2 might be known as local branch by user at site S1
 DDBMS has task of mapping an alias to appropriate database object

Replication transparency ensures that the user is not involved in the management of
copies of some data
• The user should even not be aware about the existence of replicas, rather should
work as if there exists a single copy of the data
• Replication of data is needed for various reasons
 e.g., increased efficiency for read-only data access

Fragmentation transparency ensures that the user is not aware of and is not involved in
the fragmentation of the data
• The user is not involved in finding query processing strategies over fragments or
formulating queries over fragments
 The evaluation of a query that is specified over an entire relation but now has
to be performed on top of the fragments requires an appropriate query
evaluation strategy
• Fragmentation is commonly done for reasons of performance, availability, and
reliability
• Two fragmentation alternatives
 Horizontal fragmentation: divide a relation into a subsets of tuples
 Vertical fragmentation: divide a relation by columns
Transaction transparency ensures that all distributed transactions maintain integrity and
consistency of the DDB and support concurrency
• Each distributed transaction is divided into a number of sub-transactions (a sub-
transaction for each site that has relevant data) that concurrently access data at
different locations
• DDBMS must ensure the indivisibility of both the global transaction and each of the
sub-transactions
• Can be further divided into
 Concurrency transparency
 Failure transparency

Concurrency transparency guarantees that transactions must execute independently


and are logically consistent, i.e., executing a set of transactions in parallel gives the same
result as if the transactions were executed in some arbitrary serial order.

• Same fundamental principles as for centralized DBMS, but more complicated to


realize:
 DDBMS must ensure that global and local transactions do not interfere with
each other
 DDBMS must ensure consistency of all sub-transactions of global
transaction
• Replication makes concurrency even more complicated
 If a copy of a replicated data item is updated, update must be propagated to
all copies
 Option 1: Propagate changes as part of original transaction, making it an
atomic operation; however, if one site holding a copy is not reachable, then
the transaction is delayed until the site is reachable.
 Option 2: Limit update propagation to only those sites currently available;
remaining sites are updated when they become available again.
 Option 3: Allow updates to copies to happen asynchronously, sometime
after the original update; delay in regaining consistency may range from a
few seconds to several hours

Failure transparency: DDBMS must ensure atomicity and durability of the global
transaction, i.e., the sub-transactions of the global transaction either all commit or all
abort.
• Thus, DDBMS must synchronize global transaction to ensure that all sub-
transactions have completed successfully before recording a final COMMIT for the
global transaction
• The solution should be robust in presence of site and network failures

• Performance transparency: DDBMS must perform as if it were a centralized DBMS


 DDBMS should not suffer any performance degradation due to the
distributed architecture
 DDBMS should determine most cost-effective strategy to execute a request
• Distributed Query Processor (DQP) maps data request into an ordered sequence of
operations on local databases
• DQP must consider fragmentation, replication, and allocation schemas
• DQP has to decide:
 which fragment to access
 which copy of a fragment to use
 which location to use
• DQP produces execution strategy optimized with respect to some cost function
• Typically, costs associated with a distributed request include: I/O cost, CPU cost,
and communication cost
#️ Complicating Factors in Distributed Database Management Systems (DDBMS)

Designing and operating a Distributed Database Management System (DDBMS) is


significantly more complex than centralized systems due to several technical and
operational challenges.

1. Complexity

Distributed systems are more complex than centralized ones because they involve
multiple sites, networks, and synchronization. Managing data consistency,
communication between nodes, and fault tolerance adds to the system's overall difficulty.

2. Cost

Implementing and maintaining a DDBMS is expensive. It requires more hardware,


networking infrastructure, software licenses, and skilled professionals to manage the
system efficiently.

3. Security

Since data is stored and transmitted across different locations and networks, it is more
vulnerable to unauthorized access, attacks, and data breaches. Ensuring secure
communication and access control across nodes is more difficult.

4. Integrity Control More Difficult

Maintaining data integrity—ensuring that the data is accurate and consistent—is harder in
a distributed environment. This is because updates and transactions may happen
simultaneously across different locations, increasing the chance of conflicts or
inconsistencies.

5. Lack of Standards

There is no universal standard for distributed databases, which leads to incompatibility


between different systems and tools. This makes integration, development, and migration
more difficult.
6. Lack of Experience

Fewer professionals have experience with distributed systems compared to centralized


databases. This lack of expertise can lead to poor system design, inefficient operations,
and difficulties in troubleshooting.

7. Database Design More Complex

Designing a distributed database involves additional considerations like data


fragmentation, replication, and placement. Decisions must be made to balance
performance, consistency, and availability, making the design process much more
complicated.

#Relational Database Concepts

Relational databases are based on the relational model, introduced by E.F. Codd (1970),
which organizes data into structured tables (relations) with defined relationships. Below is
a comprehensive breakdown of key concepts:

1. Core Components of the Relational Model

(a) Relations (Tables)

• A relation is a two-dimensional table with rows (tuples) and columns (attributes).


• Each table represents an entity (e.g., Students, Courses).
• Properties:
 No duplicate rows (each row is unique).
 Column order doesn’t matter.
 Row order doesn’t matter.

(b) Attributes (Columns)

• Define the structure of a relation (e.g., StudentID, Name, Age).


• Each attribute has a domain (data type), e.g., INTEGER, VARCHAR, DATE.
• Key Attributes:
 Primary Key (PK): Uniquely identifies a row (e.g., StudentID).
 Foreign Key (FK): References a PK in another table (e.g., CourseID in
Enrollments).

(c) Tuples (Rows)

• A single record in a table (e.g., (101, "Alice", 22)).


• Must satisfy entity integrity (PK cannot be NULL).

2. Relational Constraints

(a) Entity Integrity

• Primary Key must be unique and not NULL.


• Ensures each row is identifiable.

(b) Referential Integrity

• Foreign Key must match an existing Primary Key (or be NULL).


• Prevents "orphaned" records (e.g., no Enrollment for a nonexistent Student).

(c) Domain Constraints

• Ensures data types are respected (e.g., Age must be an integer).

(d) User-Defined Constraints

• Custom rules (e.g., Salary > 0).

3. Relational Algebra (Operations)

A formal system for querying relational databases. Key operations:

Sym
Operation Description Example (SQL-like)
bol

Selection σ Filters rows (WHERE clause) σ(Age > 20)(Students)


Selects columns (SELECT
Projection π π(Name, Age)(Students)
clause)

Join ⋈ Combines tables (JOIN) Students ⋈ Enrollments

Merges rows from two tables π(Name)(Students) ∪


Union ∪
(UNION) π(Name)(Teachers)

π(CourseID)(Math) −
Difference − Rows in A but not B (EXCEPT)
π(CourseID)(Art)

Cartesian All possible row


× Students × Courses
Product combinations

4. Normalization (Eliminating Redundancy)

Normalization minimizes redundancy by decomposing tables into smaller, well-structured


relations.

(a) 1NF (First Normal Form)

• Atomic values (no repeating groups or arrays).


• Example:
Bad: Student (ID, Name, Courses: [Math, Physics])
Good: Student (ID, Name) + Enrollment (StudentID, Course)

(b) 2NF (Second Normal Form)

• 1NF + no partial dependencies (non-key attributes depend on the full PK).


• Example:
Bad: Orders (OrderID, ProductID, ProductName, Quantity)
Good: Split into Orders (OrderID, ProductID, Quantity) and Products
(ProductID, ProductName)

(c) 3NF (Third Normal Form)

• 2NF + no transitive dependencies (non-key attributes depend only on PK).


• Example:
Bad: Employees (EmpID, Dept, DeptManager)
Good: Employees (EmpID, Dept) + Departments (Dept, DeptManager)

(d) BCNF (Boyce-Codd Normal Form)

• Stricter than 3NF; ensures every determinant is a candidate key.

(e) 4NF & 5NF

• Deal with multi-valued dependencies and join anomalies (rarely used in practice).

5. SQL (Structured Query Language)

SQL is the standard language for interacting with relational databases.

(a) Key SQL Commands

Comman
Purpose Example
d

SELECT Retrieve data SELECT * FROM Students WHERE Age > 20

INSERT Add new records INSERT INTO Students VALUES (101, 'Alice')

Modify existing
UPDATE UPDATE Students SET Age = 23 WHERE ID = 101
data

DELETE Remove records DELETE FROM Students WHERE ID = 101

CREATE Define a new CREATE TABLE Students (ID INT PRIMARY KEY,
TABLE table Name VARCHAR(50))

ALTER Modify table ALTER TABLE Students ADD COLUMN Email


TABLE structure VARCHAR(100)

DROP
Delete a table DROP TABLE Students
TABLE
(b) Joins

Join
Description Example
Type

SELECT * FROM Students INNER JOIN


INNER Returns matching
Enrollments ON Students.ID =
JOIN rows
Enrollments.StudentID

LEFT All rows from left + SELECT * FROM Students LEFT JOIN Enrollments
JOIN matching right ON Students.ID = Enrollments.StudentID

SELECT * FROM Students RIGHT JOIN


RIGHT All rows from right +
Enrollments ON Students.ID =
JOIN matching left
Enrollments.StudentID

FULL All rows from both SELECT * FROM Students FULL JOIN Enrollments
JOIN tables ON Students.ID = Enrollments.StudentID

6. ACID Properties (Transaction Management)

Relational databases ensure reliable transactions via ACID:

• Atomicity: Transactions are all-or-nothing (if one part fails, the whole transaction
rolls back).
• Consistency: Transactions bring the database from one valid state to another.
• Isolation: Concurrent transactions don’t interfere (via locking or MVCC).
• Durability: Once committed, changes persist even after a crash.

7. Indexes & Optimization

• Indexes (e.g., B-trees) speed up searches (but slow down inserts/updates).


• Query Optimization: The DBMS picks the best execution plan (e.g., EXPLAIN in
SQL).
8. Distributed Relational Databases

• Extends the relational model across multiple machines (e.g., sharding,


replication).
• Challenges: Consistency (CAP Theorem), network latency, distributed joins.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy