Rdbms - Unit 5
Rdbms - Unit 5
Rdbms - Unit 5
Query Processing includes translations on high level Queries into low level expressions that can
be used at physical level of file system, query optimization and actual execution of query to get
the actual result.
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:
Step-1:
Parser: During parse call, the database performs the following checks- Syntax check,
Semantic check and Shared pool check, after converting the query into relational
algebra.
Parser performs the following checks as (refer detailed diagram):
Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for
one unique DML statement and perform optimization during this parse. This database
never optimizes DDL unless it includes a DML component such as subquery that require
optimization.
It is a process in which multiple query execution plan for satisfying a query are examined
and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost
plan for execution.
Step-3:
Execution Engine: Finally runs the query and display the required result.
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database.
Example:
In SQL, a user wants to fetch the records of the employees whose salary is greater than or equal
to 10000. For doing this, the following query is undertaken:
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
DBMS - Transaction
A transaction can be defined as a group of tasks. A single task is the minimum processing unit
which cannot be divided further.
Let’s take an example of a simple transaction. Suppose a bank employee transfers Rs 500 from
A's account to B's account. This very simple and small transaction involves several low-level
tasks.
A’s Account
Open_Account(A)
Old_Balance = A.balance
New_Balance = Old_Balance - 500
A.balance = New_Balance
Close_Account(A)
B’s Account
Open_Account(B)
Old_Balance = B.balance
New_Balance = Old_Balance + 500
B.balance = New_Balance
Close_Account(B)
ACID Properties
A transaction is a very small unit of a program and it may contain several lowlevel tasks. A
transaction in a database system must maintain Atomicity, Consistency, Isolation, and
Durability − commonly known as ACID properties − in order to ensure accuracy, completeness,
and data integrity.
Atomicity − This property states that a transaction must be treated as an atomic unit,
that is, either all of its operations are executed or none. There must be no state in a
database where a transaction is left partially completed. States should be defined either
before the execution of the transaction or after the execution/abortion/failure of the
transaction.
Consistency − The database must remain in a consistent state after any transaction. No
transaction should have any adverse effect on the data residing in the database. If the
database was in a consistent state before the execution of a transaction, it must remain
consistent after the execution of the transaction as well.
Durability − The database should be durable enough to hold all its latest updates even if
the system fails or restarts. If a transaction updates a chunk of data in a database and
commits, then the database will hold the modified data. If a transaction commits but
the system fails before the data could be written on to the disk, then that data will be
updated once the system springs back into action.
Isolation − In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions
will be carried out and executed as if it is the only transaction in the system. No
transaction will affect the existence of any other transaction.
Serializability
When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with
some other transaction.
To resolve this problem, we allow parallel execution of a transaction schedule, if its transactions
are either serializable or have some equivalence relation among them.
Equivalence Schedules
Result Equivalence
If two schedules produce the same result after execution, they are said to be result equivalent.
They may yield the same result for some value and different results for another set of values.
That's why this equivalence is not generally considered significant.
View Equivalence
Two schedules would be view equivalence if the transactions in both the schedules perform
similar actions in a similar manner.
For example −
If T reads the initial data in S1, then it also reads the initial data in S2.
If T reads the value written by J in S1, then it also reads the value written by J in S2.
If T performs the final write on the data value in S1, then it also performs the final write
on the data value in S2.
Conflict Equivalence
Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −
Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.
States of Transactions
Active − In this state, the transaction is being executed. This is the initial state of every
transaction.
Partially Committed − When a transaction executes its final operation, it is said to be in
a partially committed state.
Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.
Aborted − If any of the checks fails and the transaction has reached a failed state, then
the recovery manager rolls back all its write operations on the database to bring the
database back to its original state where it was prior to the execution of the transaction.
Transactions in this state are called aborted. The database recovery module can select
one of the two operations after a transaction aborts −
o Re-start the transaction
o Kill the transaction
Committed − If a transaction executes all its operations successfully, it is said to be
committed. All its effects are now permanently established on the database system.
Database systems equipped with lock-based protocols use a mechanism by which any
transaction cannot read or write data until it acquires an appropriate lock on it. Locks are of
two kinds −
Binary Locks − A lock on a data item can be in two states; it is either locked or unlocked.
Shared/exclusive − This type of locking mechanism differentiates the locks based on
their uses. If a lock is acquired on a data item to perform a write operation, it is an
exclusive lock. Allowing more than one transaction to write on the same data item
would lead the database into an inconsistent state. Read locks are shared because no
data value is being changed.
Simplistic lock-based protocols allow transactions to obtain a lock on every object before a
'write' operation is performed. Transactions may unlock the data item after completing the
‘write’ operation.
Pre-claiming protocols evaluate their operations and create a list of data items on which they
need locks. Before initiating an execution, the transaction requests the system for all the locks
it needs beforehand. If all the locks are granted, the transaction executes and releases all the
locks when all its operations are over. If all the locks are not granted, the transaction rolls back
and waits until all the locks are granted.
This locking protocol divides the execution phase of a transaction into three parts. In the first
part, when the transaction starts executing, it seeks permission for the locks it requires. The
second part is where the transaction acquires all the locks. As soon as the transaction releases
its first lock, the third phase starts. In this phase, the transaction cannot demand any new locks;
it only releases the acquired locks.
Two-phase locking has two phases, one is growing, where all the locks are being acquired by
the transaction; and the second phase is shrinking, where the locks held by the transaction are
being released.
To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then
upgrade it to an exclusive lock.
The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first phase, the
transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release a
lock after using it. Strict-2PL holds all the locks until the commit point and releases all the locks
at a time.
Timestamp-based Protocols
The most commonly used concurrency protocol is the timestamp based protocol. This protocol
uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions at
the time of execution, whereas timestamp-based protocols start working as soon as a
transaction is created.
Every transaction has a timestamp associated with it, and the ordering is determined by the age
of the transaction. A transaction created at 0002 clock time would be older than all other
transactions that come after it. For example, any transaction 'y' entering the system at 0004 is
two seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the system
know when the last ‘read and write’ operation was performed on the data item.
This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and Ti is rolled back.
Time-stamp ordering rules can be modified to make the schedule view serializable.
For example, assume a set of transactions {T0, T1, T2, ...,Tn}. T0 needs a resource X to complete
its task. Resource X is held by T1, and T1 is waiting for a resource Y, which is held by T2. T2 is
waiting for resource Z, which is held by T0. Thus, all the processes wait for each other to release
resources. In this situation, none of the processes can finish their task. This situation is known
as a deadlock.
Deadlocks are not healthy for a system. In case a system is stuck in a deadlock, the transactions
involved in the deadlock are either rolled back or restarted.
Deadlock Prevention
To prevent any deadlock situation in the system, the DBMS aggressively inspects all the
operations, where transactions are about to execute. The DBMS inspects the operations and
analyzes if they can create a deadlock situation. If it finds that a deadlock situation might occur,
then that transaction is never allowed to be executed.
There are deadlock prevention schemes that use timestamp ordering mechanism of
transactions in order to predetermine a deadlock situation.
Wait-Die Scheme
In this scheme, if a transaction requests to lock a resource (data item), which is already held
with a conflicting lock by another transaction, then one of the two possibilities may occur −
If TS(Ti) < TS(Tj) − that is Ti, which is requesting a conflicting lock, is older than Tj − then Ti
is allowed to wait until the data-item is available.
If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later with a
random delay but with the same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme
In this scheme, if a transaction requests to lock a resource (data item), which is already held
with conflicting lock by some another transaction, one of the two possibilities may occur −
If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is Ti wounds Tj. Tj is restarted
later with a random delay but with the same timestamp.
If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.
This scheme, allows the younger transaction to wait; but when an older transaction requests an
item held by a younger one, the older transaction forces the younger one to abort and release
the item.
In both the cases, the transaction that enters the system at a later stage is aborted.
Deadlock Avoidance
Wait-for Graph
This is a simple method available to track if any deadlock situation may arise. For each
transaction entering into the system, a node is created. When a transaction Ti requests for a
lock on an item, say X, which is held by some other transaction Tj, a directed edge is created
from Ti to Tj. If Tj releases item X, the edge between them is dropped and Ti locks the data item.
The system maintains this wait-for graph for every transaction waiting for some data items held
by others. The system keeps checking if there's any cycle in the graph.
First, do not allow any request for an item, which is already locked by another
transaction. This is not always feasible and may cause starvation, where a transaction
indefinitely waits for a data item and can never acquire it.
The second option is to roll back one of the transactions. It is not always feasible to roll
back the younger transaction, as it may be important than the older one. With the help
of some relative algorithm, a transaction is chosen, which is to be aborted. This
transaction is known as the victim and the process is known as victim selection.
A volatile storage like RAM stores all the active logs, disk buffers, and related data. In addition,
it stores all the transactions that are being currently executed. What happens if such a volatile
storage crashes abruptly? It would obviously take away all the logs and active copies of the
database. It makes recovery almost impossible, as everything that is required to recover the
data is lost.
We can have checkpoints at multiple stages so as to save the contents of the database
periodically.
A state of active database in the volatile memory can be periodically dumped onto a
stable storage, which may also contain logs and active transactions and buffer blocks.
<dump> can be marked on a log file, whenever the database contents are dumped from
a non-volatile memory to a stable one.
Recovery
When the system recovers from a failure, it can restore the latest dump.
It can maintain a redo-list and an undo-list as checkpoints.
It can recover the system by consulting undo-redo lists to restore the state of all
transactions up to the last checkpoint.
A catastrophic failure is one where a stable, secondary storage device gets corrupt. With the
storage device, all the valuable data that is stored inside is lost. We have two different
strategies to recover data from such a catastrophic failure −
Remote backup &minu; Here a backup copy of the database is stored at a remote
location from where it can be restored in case of a catastrophe.
Alternatively, database backups can be taken on magnetic tapes and stored at a safer
place. This backup can later be transferred onto a freshly installed database to bring it to
the point of backup.
Grown-up databases are too bulky to be frequently backed up. In such cases, we have
techniques where we can restore a database just by looking at its logs. So, all that we need to
do here is to take a backup of all the logs at frequent intervals of time. The database can be
backed up once a week, and the logs being very small can be backed up every day or as
frequently as possible.
Remote Backup
Remote backup provides a sense of security in case the primary location where the database is
located gets destroyed. Remote backup can be offline or real-time or online. In case it is offline,
it is maintained manually.
Online backup systems are more real-time and lifesavers for database administrators and
investors. An online backup system is a mechanism where every bit of the real-time data is
backed up simultaneously at two distant places. One of them is directly connected to the
system and the other one is kept at a remote place as backup.
As soon as the primary database storage fails, the backup system senses the failure and
switches the user system to the remote storage. Sometimes this is so instant that the users
can’t even realize a failure.
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed every second.
The durability and robustness of a DBMS depends on its complex architecture and its
underlying hardware and system software. If it fails or crashes amid transactions, it is expected
that the system would follow some sort of algorithm or techniques to recover lost data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various categories, as
follows −
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from where it
can’t go any further. This is called transaction failure where only a few transactions or
processes are hurt.
Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system aborts
an active transaction.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly and
cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided
into two categories −
Volatile storage − As the name suggests, a volatile storage cannot survive system
crashes. Volatile storage devices are placed very close to the CPU; normally they are
embedded onto the chipset itself. For example, main memory and cache memory are
examples of volatile storage. They are fast but can store only a small amount of
information.
Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.
When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations, which
are atomic in nature. But according to ACID properties of DBMS, atomicity of transactions as a
whole must be maintained, that is, either all the operations are executed or none.
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled
back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and
later, the actual database is updated.
Log-based Recovery
<Tn, commit>
Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
Immediate database modification − Each log follows an actual database modification.
That is, the database is modified immediately after every operation.
When more than one transaction are being executed in parallel, the logs are interleaved. At the
time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was
in consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner −
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <T n, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
Keys in DBMS
KEYS in DBMS is an attribute or set of attributes which helps you to identify a row(tuple) in a
relation(table). They allow you to find the relation between two tables. Keys help you uniquely
identify a row in a table by a combination of one or more columns in that table. Key is also
helpful for finding unique record or row from the table. Database key is also helpful for finding
unique record or row from the table.
For example, ID is used as a key in the Student table because it is unique for each student. In
the PERSON table, passport_number, license_number, SSN are keys since they are unique for
each person.
Types of keys:
1. Primary key
It is the first key used to identify one and only one instance of an entity uniquely. An
entity can contain multiple keys, as we saw in the PERSON table. The key which is most
suitable from those lists becomes a primary key.
In the EMPLOYEE table, ID can be the primary key since it is unique for each employee.
In the EMPLOYEE table, we can even select License_Number and Passport_Number as
primary keys since they are also unique.
For each entity, the primary key selection is based on requirements and developers.
2. Candidate key
A candidate key is an attribute or set of attributes that can uniquely identify a tuple.
Except for the primary key, the remaining attributes are considered a candidate key. The
candidate keys are as strong as the primary key.
For example: In the EMPLOYEE table, id is best suited for the primary key. The rest of the
attributes, like SSN, Passport_Number, License_Number, etc., are considered a candidate key.
3. Super Key
Super key is an attribute set that can uniquely identify a tuple. A super key is a superset of a
candidate key.
For example: In the above EMPLOYEE table, for(EMPLOEE_ID, EMPLOYEE_NAME), the name of
two employees can be the same, but their EMPLYEE_ID can't be the same. Hence, this
combination can also be a key.
Foreign keys are the column of the table used to point to the primary key of another
table.
Every employee works in a specific department in a company, and employee and
department are two different entities. So we can't store the department's information
in the employee table. That's why we link these two tables through the primary key of
one table.
We add the primary key of the DEPARTMENT table, Department_Id, as a new attribute
in the EMPLOYEE table.
In the EMPLOYEE table, Department_Id is the foreign key, and both the tables are
related.
5. Alternate key
There may be one or more attributes or a combination of attributes that uniquely identify each
tuple in a relation. These attributes or combinations of the attributes are called the candidate
keys. One key is chosen as the primary key from these candidate keys, and the remaining
candidate key, if it exists, is termed the alternate key. In other words, the total number of the
alternate keys is the total number of candidate keys minus the primary key. The alternate key
may or may not exist. If there is only one candidate key in a relation, it does not have an
alternate key.
For example, employee relation has two attributes, Employee_Id and PAN_No, that act as
candidate keys. In this relation, Employee_Id is chosen as the primary key, so the other
candidate key, PAN_No, acts as the Alternate key.
6. Composite key
Whenever a primary key consists of more than one attribute, it is known as a composite key.
This key is also known as Concatenated Key.
For example, in employee relations, we assume that an employee may be assigned multiple
roles, and an employee may work on multiple projects simultaneously. So the primary key will
be composed of all three attributes, namely Emp_ID, Emp_role, and Proj_ID in combination. So
these attributes act as a composite key since the primary key comprises more than one
attribute.
7. Artificial key
The key created using arbitrarily assigned data are known as artificial keys. These keys are
created when a primary key is large and complex and has no relationship with many other
relations. The data values of the artificial keys are usually numbered in a serial order.
For example, the primary key, which is composed of Emp_ID, Emp_role, and Proj_ID, is large in
employee relations. So it would be better to add a new virtual attribute to identify each tuple in
the relation uniquely.
Following are the important differences between Primary Key and Candidate key.
Primary Key is a unique and non-null Candidate key is also a unique key to
key which identify a record uniquely in identify a record uniquely in a table
1 Definition
table. A table can have only one but a table can have multiple
primary key. candidate keys.
Primary key column value can not be Candidate key column can have null
2 Null
null. value.
Primary key is most important part of Candidate key signifies as which key
3 Objective
any relation or table. can be used as Primary Key.
Following are the important differences between Super Key and Candidate key.
Selection using primary key creates Selection using unique key creates non-clustered
clustered index index