Parallel Database
Parallel Database
Parallel Database
Introduction
Parallel machines are becoming quite common and affordable
Prices of microprocessors, memory and disks have dropped
sharply
Databases are growing increasingly large
large volumes of transaction data are collected and stored for
later analysis.
multimedia objects like images are increasingly stored in
databases
Large-scale parallel database systems increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
Disadvantages
1. More start-up costs
2. Interference Problem
3. Skew problem
Q2.what is Speedup & Scaleup?
SpeedUp:
1. Speedup is More resources means proportionally less time for given
amount of data.
ScaleUp:
1. Scale-Up is if resources increased in proportion to increase in data size,
time is constant.
2. Increase the size of both the problem and the system N-times larger system
used to perform N-times larger job Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
3. Scale up is linear if equation equals 1.
Factors Limiting Speedup and Scaleup:
Speedup and scaleup are often sublinear due to:
1. Startup costs:
Cost of starting up multiple processes may dominate computation
time, if the degree of parallelism is high.
2. Interference:
Processes accessing shared resources (e.g.,system bus, disks, or
locks) compete with each other
3. Skew:
Overall execution time determined by slowest of parallely
executing tasks.
2. Shared disk :
a. processors have direct access to all disks.
b. Each CPU has a private memory and direct access to all disks
through an interconnection network.
3. Shared nothing :
a. processors share neither a common memory nor common disk
b. each CPU has local main memory and disk space, but no two
CPUs can access the same storage area; all communication
between CPUs is through a network connection.
1. Shared memory architecture:
1. Multiple CPUs are attached to an interconnection network and can
access a common region of main memory
2. Extremely efficient communication between processors — data in shared
memory can be accessed by any processor without having to move it
using software.
3. Disadvantage – architecture is not scalable beyond 32 or 64 processors
since the bus or the interconnection network becomes a bottleneck
4. Widely used for lower degrees of parallelism (4 to 8).
1. Each CPU has a private memory and direct access to all disks through
an interconnection network.
2. All processors can directly access all disks via an interconnection
network, but the processors have private memories.
a. The memory bus is not a bottleneck
b. Architecture provides a degree of fault-tolerance — if a
processor fails, the other processors can take over its tasks since
the database is resident on disks that are accessible from all
processors.
3. Examples: IBM Sysplex and DEC clusters
4. Downside:
Bottleneck now occurs at interconnection to the disk subsystem.
Shared-disk systems can scale to a somewhat larger number of
processors, but communication between processors is slower.
3. Shared nothing architecture
1. Node consists of a processor, memory, and one or more disks.
Processors at one node communicate with another processor at another
node using an interconnection network. A node functions as the server for
the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE
2. Data accessed from local disks (and local memory accesses) do not pass
through interconnection network, thereby minimizing the interference
of resource sharing.
3. Shared-nothing multiprocessors can be scaled up to thousands of
processors without interference.
4. Main drawback:
cost of communication and non-local disk access; sending data
involves software interaction at both ends.
Hierarchical Architecture
Combines characteristics of shared-memory, shared-disk, and
shared-nothing architectures.
PARALLEL QUERY EVALUATION
Introduction:
COMPARISON OF PARALLELISM
Evaluate how well partitioning techniques support the following types of data
access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries
E.g., r.A = 25.
3.Locating all tuples such that the value of a given attribute lies within a
specified range – range queries.
E.g., 10 r.A < 25.
1. Input/ Output parallelism
Parallel I/O refers to the process of writing to, or reading from, two
or more I/O devices simultaneously.
a. Data partitioning
Reduce the time required to retrieve relations from disk by partitioning
the relations on multiple disks.
1. Horizontal partitioning :
tuples of a relation are divided among many disks such that each
tuple resides on one disk. (number of disks = n)
2. Vertical partitioning :
involves creating tables with fewer columns and using additional
tables to store the remaining columns.
Horizontal partitioning
a. Round-robin partitioning
b. Hash partitioning
c. Range partitioning
d. Schema partitioning
a. Round-robin partitioning
It Send the ith tuple inserted in the relation to diski mod n. It
ensures even distribution of tuples across disks.
Advantages
1. Best suited for sequential scan of entire relation on each query.
2. All disks have almost an equal number of tuples; retrieval work
is thus well balanced between disks.
Disadvantages
1. Not well suited for point queries on non-partitioning attributes
2. No clustering, so difficult to answer range queries
c. Range partitioning
1. Administrator specifies that attribute values within a range are to be
placed on a certain disk.
Eg: range partitioning with three disks numbered 0,1,2 might place
2. Tuples for employee numbers upto 1000 on disk 0, tuples for
employee numbers 1001-1500 on disk 1 and tuples for employee
numbers 1501 to 2000 to disk 2.
Advantages
1. Offers good performance for range based queries and exact match
queries involving partitioning attributes.
2. Search narrows to exactly those disks that might have any tuples of
interest.
Disadvantages:
It Can cause skewing in some cases.
d. Schema partitioning
Different relations within a database are placed on different disks.
Disadvantage:
More prone to data skewing.
Continue of Q6
2. Inter-query parallelism
Example:
- A relation has been partitioned across multiple disks by range partitioning on
some attribute.
- The user wants to sort on the partitioning attibute.
- The sort operation can be implemented by sorting each partition in parallel ,
then concatenating the sorted partitions to get the final sorted relation.
2. Independent parallelism
Advantage
Disadvantage