DBMS
DBMS
UNIT-1
OVERVIEW OF DATABASE SYSTEMS
(1) Overview of DBMS
Data:
➢ Data is collection of raw facts and figures, such as numbers, words, measurements, observations or
just descriptions of things.
➢ Alone it tells you nothing.
➢ The real goal is turn the data into information.
➢ Data is not in meaningful way, it is unorganized format.
➢ Once the data is processed, organized, structured or presented in given context, it can become useful.
EX: your name, age, height etc.
Information:
➢ Information is credited from the data
➢ Information is nothing but refined data that have been put into a meaningful and useful context.
➢ Information is the back bone of any otganization.it consists of images, text, document and voice etc.
➢ The 3 key attributes of information are ,
▪ Accuracy(correctness)
▪ Timelines
▪ Relevancy
Data Base:
➢ The data is an organised form of the word “DATUM” that means “single piece of information”.
➢ Data base is organised collection of data, so that it can be easily stored, managed and accessed
electrically from a computer system.
➢ It is also used to organise the data in the form of rows and columns and index etc. To make it easier
to find relevant information.
➢ The main purpose of database is to operate large amount of information by storing, retrieving and
managing the data.
➢ There are many databases are available , such as
MySQL, Sybase, Oracle, SQL server etc.,
➢ A cylindrical structure is used to represent the data.
DBMS:
➢ DBMS is software which is used to manage the database.
➢ It is a collection of programs that enable to store, modify, and extract information from the database.
➢ It is a piece of software that provides services for accessing a database, while maintaining all the
required features of data
➢ It provides security and protection to the database; in case of multiple users it also maintains data
consistency.
2
➢ It provides an interface to perform various operations like database creations, storing data in it,
updating data, creating tables in database etc.,
Advantages of DBMS:
• Reducing Data Redundancy:
o The file based data management systems contained multiple files that were stored in many
different locations in a system or even across multiple systems. Because of this, there were
sometimes multiple copies of the same file which lead to data redundancy.
• Sharing of Data:
o In a database, the users of the database can share the data among themselves. There are
various levels of authorisation to access the data, and consequently the data can only be
shared based on the correct authorisation protocols being followed.
• Data Integrity:
o Data integrity means that the data is accurate and consistent in the database. Data Integrity is
very important as there are multiple databases in a DBMS. All of these databases contain data
that is visible to multiple users.
3
• Data Security:
o Data Security is vital concept in a database. Only authorised users should be allowed to
access the database and their identity should be authenticated using a username and
password.
• Privacy:
o The privacy rule in a database means only the authorized users can access a database
according to its privacy constraints. There are levels of database access and a user can only
view the data he is allowed to
• Backup and Recovery:
o Database Management System automatically takes care of backup and recovery. The users
don't need to backup data periodically because this is taken care of by the DBMS. Moreover,
it also restores the database after a crasher system failure to its previous condition.
• Data Consistency:
o Data consistency is ensured in a database because there is no data redundancy. All data
appears consistently across the database and the data is same for all the users viewing the
database..
Disadvantages of DBMS:
1. High Cost:
The high cost of software and hardware is the main disadvantage of the database management system.
Database users require a high-speed processor and huge memory size to use the database on the
DBMS.
2. Complexity:
Database management system (DBMS) is so complex for non-technical users. So, it isn’t easy to
manage and maintain database systems.
3. Increased Staff Cost:
DBMS requires an educated and skilled staff for managing and maintaining the databases. So, we need
to spend a lot of money to get this level of trained and experienced staff.
4. Requirement of Technical Staff:
A non-technical people can’t understand the complexity of the database. So, the technical staffs are
required for maintaining and handling the database management system.
5. Cost of Data Conversion:
It is one of the big disadvantages of the database management system because the cost of data
conversion is very high.
6. Performance:
Performance is another big disadvantage of database systems because the speed of the
database systems for small firms and organizations is very slow.
Objectives of DBMS:
1. Eliminate redundant data.
2. Make access to the data easy for the user.
3. Provide for mass storage of relevant data.
4. Protect the data from physical harm and un-authorised systems.
5. Allow for growth in the data base system.
6. Make the latest modifications to the data base available immediately.
7. Allow for multiple users to be active at one time.
8. Provide prompt response to user requests for data.
(2)Describing and storing data in DBMS:
➢ A data model is a collection of high-level data description constructs that hide many low-level storage
details
➢ A widely used semantic data model called the entity-relationship (ER) model allows us to pictorially
denote entities and the relationships among them
1. Data abstraction -: (3 Layer/Level Architecture)
Data abstraction is one of the fundamental characteristic of any database, which helps in making data
more accurate and easy to use.
Data abstraction refers to the act of representing data without giving details that how data are stored
or maintained. There are different levels of abstraction.
4
(3)Queries in DBMS:
A query is a question or a request for information expressed in a formal manner. In computer science,
a query is essentially the same thing, the only difference is the answer or retrieved information comes from
a database
5
A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as insertion,
updating, deleting or other forms of data manipulation.
Query languages are used to make queries in a database, and Microsoft Structured Query Language
(SQL) is the standard.
Note: SQL and MySQL are not the same, as the latter is a software extension that uses SQL. Other
language extensions of the language include Oracle SQL and NuoDB.
Although Microsoft's SQL is the most popular language, there are many other types of databases and
languages.
Query by Example (QBE)
➢ In this method, the system displays a blank record and lets you identify the fields and values that
define the query.
➢ This is a method of query creation that authorises the user to look for documents based on an
example in the form of a selected text string, or in the form of a document name, or even a list of
documents. Because the QBE system develops the actual query, QBE is easier to grasp than formal
query languages, while still enabling powerful searches.
Before we delve into the examples, below are the main benefits of using a query:
• Review data from multiple tables simultaneously.
• Filter records containing only certain fields and of certain criteria.
• Automate data management tasks and perform calculations .
(4)Transaction management:
In the context of transaction processing, the acronym ACID refers to the four key properties of a transaction:
Atomicity
All changes to data are performed as if they are a single operation. That is, all the changes are
performed, or none of them are.
For example, in an application that transfers funds from one account to another, the atomicity property
ensures that, if a debit is made successfully from one account, the corresponding credit is made to the other
account.
Consistency
Data is in a consistent state when a transaction starts and when it ends.
For example, in an application that transfers funds from one account to another, the consistency
property ensures that the total value of funds in both the accounts is the same at the start and end of each
transaction.
6
Isolation
The intermediate state of a transaction is invisible to other transactions. As a result, transactions that
run concurrently appear to be serialized.
For example, in an application that transfers funds from one account to another, the isolation property
ensures that another transaction sees the transferred funds in one account or the other, but not in both, nor in
neither.
Durability
After a transaction successfully completes, changes to data persist and are not undone, even in the
event of a system failure. For example, in an application that transfers funds from one account to another,
the durability property ensures that the changes made to each account will not be reversed.
(5)Structure of DBMS:
• Database Management System (DBMS) is software that allows access to data stored in a database and
provides an easy and effective method of
• Defining the information.
• Storing the information.
• Manipulating the information.
• Protecting the information from system crashes or data theft.
• Differentiating access permissions for different users.
• The database system is divided into three components: Query Processor, Storage Manager, and Disk
Storage. These are explained as following below.
• STRUCTURE OF DBMS
• DBMS (Database Management System) acts as an interface between the user and the database. The user
requests the DBMS to perform various operations (insert, delete, update and retrieval) on the database.
The components of DBMS perform these requested operations on the database and provide necessary
data to the users. The various components of DBMS are shown below: -
• 1. DDL Compiler - Data Description Language compiler processes schema definitions specified in the
DDL. It includes metadata information such as the name of the files, data items, storage details of each
file, mapping information and constraints etc.
• 2. DML Compiler and Query optimizer - The DML commands such as insert, update, delete, retrieve
from the application program are sent to the DML compiler for compilation into object code for database
access. The object code is then optimized in the best way to execute a query by the query optimizer and
then send to the data manager.
• 3. Data Manager - The Data Manager is the central software component of the DBMS also knows
as Database Control System.
•
7
•
• Fig. 2.1 Structure Of DBMS
(5)Data Models
Data Model : Data Model gives us an idea that how the final system will look like after its complete
implementation.
It defines the data elements and the relationships between the data elements. Data Models are used to
show how data is stored, connected, accessed and updated in the database management system.
Here, we use a set of symbols and text to represent the information so that members of the
organisation can communicate and understand it.
Though there are many data models being used nowadays but the Relational model is the most
widely used model.
Apart from the Relational model, there are many other types of data models about which we will
study in details in this blog. Some of the Data Models in DBMS are:
1. Hierarchical Model
2. Network Model
3. Entity-Relationship Model
4. Relational Model
5. Object-Oriented Data Model
6. Object-Relational Data Model
7. Flat Data Model
8. Semi-Structured Data Model
Hierarchical Model
Hierarchical Model was the first DBMS model. This model organises the data in the hierarchical tree
structure.
The hierarchy starts from the root which has root data and then it expands in the form of a tree adding
child node to the parent node.
This model easily represents some of the real-world relationships like food recipes, sitemap of a website
etc.
Example: We can represent the relationship between the shoes present on a shopping website in the
following way:
• Any change in the parent node is automatically reflected in the child node so, the integrity of data
is maintained.
Disadvantages of Hierarchical Model
• Complex relationships are not supported.
• As it does not support more than one parent of the child node so if we have some complex
relationship where a child node needs to have two parent node then that can't be represented using
this model.
• If a parent node is deleted then the child node is automatically deleted.
Network Model
This model is an extension of the hierarchical model.
It was the most popular model before the relational model. This model is the same as the hierarchical
model, the only difference is that a record can have more than one parent.
It replaces the hierarchical tree with a graph.
Example: In the example below we can see that node student has two parents i.e. CSE Department and
Library. This was earlier not possible in the hierarchical model.
• Attributes: An entity contains a real-world property called attribute. This is the characteristics of
that attribute. Example: The entity teacher has the property like teacher id, salary, age, etc.
• Relationship: Relationship tells how two attributes are related. Example: Teacher works for a
department.
Example:
In the above example, we have two objects Employee and Department. All the data and
relationships of each object are contained as a single unit. The attributes like Name, Job_title of
the employee and the methods which will be performed by that object are stored as a single object.
The two objects are connected through a common attribute i.e the Department_id and the
communication between these two will be done with the help of this common id.
12
Object-Relational Model
As the name suggests it is a combination of both the relational model and the object-oriented
model. This model was built to fill the gap between object-oriented model and the relational
model. We can have many advanced features like we can make complex data types according to
our requirements using the existing data types. The problem with this model is that this can get
complex and difficult to handle. So, proper understanding of this model is required.
Flat Data Model
It is a simple model in which the database is represented as a table consisting of rows and
columns. To access any data, the computer has to read the entire table. This makes the modes slow
and inefficient.
Semi-Structured Model
Semi-structured model is an evolved form of the relational model. We cannot differentiate
between data and schema in this model.
Example: Web-Based data sources which we can't differentiate between the schema and data of
the website.
(6)What is Database Architecture?
A Database Architecture is a representation of DBMS design. It helps to design, develop, implement,
and maintain the database management system. DBMS architecture allows dividing the database system into
individual components that can be independently modified, changed, replaced, and altered. It also helps to
understand the components of a database.
1-Tier Architecture:
1Tier Architecture in DBMS is the simplest architecture of Database in which the client, server, and
Database all reside on the same machine. A simple one tier architecture example would be anytime you
install a Database in your system and access it to practice SQL queries. But such architecture is rarely used
in production.
2-Tier Architecture:
A 2 Tier Architecture in DBMS is a Database architecture where the presentation layer runs on a
client (PC, Mobile, Tablet, etc.), and data is stored on a server called the second tier. Two tier architecture
provides added security to the DBMS as it is not exposed to the end-user directly. It also provides direct and
faster communication.
13
In the above 2 Tier client-server architecture of database management system, we can see that one
server is connected with clients 1, 2, and 3.
3-Tier Architecture:
A 3 Tier Architecture in DBMS is the most popular client server architecture in DBMS in which the
development and maintenance of functional processes, logic, data access, data storage, and user interface is
done independently as separate modules.
Three Tier architecture contains a presentation layer, an application layer, and a database server.
3-Tier database Architecture design is an extension of the 2-tier client-server architecture.
A 3-tier architecture has the following layers:
The Application layer resides between the user and the DBMS, which is responsible for
communicating the user’s request to the DBMS system and send the response from the DBMS to the user.
The application layer (business logic layer) also processes functional logic, constraint, and rules before
passing data to the user or down to the DBMS.
The goal of Three Tier client-server architecture is:
• To separate the user applications and physical database
• To support DBMS characteristics
• Program-data independence
• Supporting multiple views of the data
•
Chapter 2: INTRODUCTION TO DATABASE DESIGN
(7) What is Database Design? Beyond ER design
• Database Design is a collection of processes that facilitate the designing, development,
implementation and maintenance of enterprise data management systems.
• Properly designed database are easy to maintain, improves data consistency and are cost effective
in terms of disk storage space. The database designer decides how the data elements correlate and
what data must be stored.
• The main objectives of database design in DBMS are to produce logical and physical designs
models of the proposed database system.
• The logical model concentrates on the data requirements and the data to be stored independent of
physical considerations. It does not concern itself with how the data will be stored or where it will
be stored physically.
• The physical data design model involves translating the logical DB design of the database onto
physical media using hardware resources and software systems such as database management
systems (DBMS).
14
A simple ER Diagram:
In the following diagram we have two entities Student and College and their relationship.
• The relationship between Student and College is many to one as a college can have many
students however a student cannot study in multiple colleges at the same time.
1. Entity
An entity is an object or component of data. An entity is represented as rectangle in an ER diagram.
For example: In the following ER diagram we have two entities Student and College and these two entities
have many to one relationship as many students study in a single college. We will read more about
relationships later, for now focus on entities.
15
Weak Entity:
An entity that cannot be uniquely identified by its own attributes and relies on the relationship with other
entity is called weak entity. The weak entity is represented by a double rectangle. For example – a bank
account cannot be uniquely identified without knowing the bank to which the account belongs, so bank
account is a weak entity.
2. Attribute
An attribute describes the property of an entity. An attribute is represented as Oval in an ER diagram.
There are four types of attributes
1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute
16
1. Key attributes:
A key attribute can uniquely identify an entity from an entity set. For example, student roll number can
uniquely identify a student from a set of students. Key attribute is represented by oval same as other
attributes however the text of key attribute is underlined.
2. Composite attribute:
An attribute that is a combination of other attributes is known as composite attribute. For example, In
student entity, the student address is a composite attribute as an address is composed of other attributes such
as pin code, state, country.
3. Multivalued attribute:
An attribute that can hold multiple values is known as multivalued attribute. It is represented
with double ovals in an ER Diagram. For example – A person can have more than one phone numbers so
the phone number attribute is multivalued.
4. Derived attribute:
A derived attribute is one whose value is dynamic and derived from another attribute. It is represented
by dashed oval in an ER Diagram. For example – Person age is a derived attribute as it changes over time
and can be derived from another attribute (Date of birth).
When a single instance of an entity is associated with a single instance of another entity then it is
called one to one relationship. For example, a person has only one passport and a passport is given to one
person.
2. One to many Relationship
When a single instance of an entity is associated with more than one instances of another entity then it
is called one to many relationship. For example – a customer can place many orders but a order cannot be
placed by many customers.
When more than one instances of an entity is associated with a single instance of another entity then it
is called many to one relationship. For example – many students can study in a single college but a student
cannot study in many colleges at the same time.
When more than one instances of an entity is associated with more than one instances of another entity
then it is called many to many relationship. For example, a can be assigned to many projects and a project
can be assigned to many students.
18
A Total participation of an entity set represents that each entity in entity set must have at least one
relationship in a relationship set. For example: In the below diagram each college must have at-least one
associated Student.
Types of relationship are as follows:
a. One-to-One Relationship:
When only one instance of an entity is associated with the relationship, then it is known as one to one
relationship.
For example, a female can marry to one male, and a male can marry to one female.
b. One-to-many relationship:
When only one instance of the entity on the left, and more than one instance of an entity on the right
associates with the relationship then this is known as a one-to-many relationship.
For example, Scientist can invent many inventions, but the invention is done by the only specific
scientist.
c. Many-to-one relationship:
When more than one instance of the entity on the left, and only one instance of an entity on the right
associates with the relationship then it is known as a many-to-one relationship.
For example, Student enrolls for only one course, but a course can have many students.
19
d. Many-to-many relationship:
When more than one instance of the entity on the left, and more than one instance of an entity on the
right associates with the relationship then it is known as a many-to-many relationship.
For example, Employee can assign by many projects and project can have many employees.
Super class shape has sub groups: Triangle, Square and Circle.
Sub classes are the group of entities with some unique attributes. Sub class inherits the properties and
attributes from super class.
Specialization and Generalization
Generalization is a process of generalizing an entity which contains generalized attributes or properties
of generalized entities.
It is a Bottom up process i.e. considers we have 3 sub entities Car, Truck and Motorcycle. Now these
three entities can be generalized into one super class named as Vehicle.
20
Specialization is a process of identifying subsets of an entity that share some different characteristic. It
is a top down approach in which one entity is broken down into low level entity.
In above example Vehicle entity can be a Car, Truck or Motorcycle.
Owner is the subset of two super classes: Vehicle and House.
Aggregation
Represents relationship between a whole object and its component.
Consider a ternary relationship Works_On between Employee, Branch and Manager. Now the best
way to model this situation is to use aggregation, So, the relationship-set, Works_On is a higher level entity-
set. Such an entity-set is treated in the same manner as any other entity-set. We can create a binary
relationship, Manager, between Works_On and Manager to represent who manages what tasks
– For each manageably sized business domain or subarea, the conceptual schema design is
performed in seven steps:
1 Transform familiar examples into elementary facts.
• 2. Draw the fact types, and apply a population check.
• 3. Check for entity types to be combined, and note any arithmetic derivations.
• 4. Add uniqueness constraints, and check the arity of fact types.
• 5. Add mandatory role constraints, and check for logical derivations.
• 6. Add value, set-comparison, and subtyping constraints.
• 7. Add other constraints and perform final checks.
22
UNIT-2
INTRODUCTION TO DATABASE DESIGN
(1)Introduction to the Relational Model:
Relational Model was proposed by E.F.Codd to model data in the form of relations or tables. After
designing the conceptual model of Database using ER diagram, we need to convert the conceptual model
in the relational model which can be implemented using any RDBMS languages like Oracle SQL, MySQL
etc. So we will see what Relational Model is.
Key Constraints
An attribute that can uniquely identify a tuple in a relation is called the key of the table. The value of
the attribute for different tuples in the relation has to be unique.
Example: In the given table, CustomerID is a key attribute of Customer Table. It is most likely to
have a single key for one customer, CustomerID =1 is only for the Customer Name =” Google”.
1 Google Active
2 Amazon Active
3 Apple Inactive
Example: Tuple for CustomerID =1 is referenced twice in the relation Billing. So we know
CustomerName=Google has billing amount $300
24
Whenever one of these operations are applied, integrity constraints specified on the relational database
schema must never be violated.
Insert Operation
The insert operation gives values of the attribute for a new tuple which should be inserted into a
relation.
Update Operation
You can see that in the below-given relation table CustomerName= ‘Apple’ is updated from Inactive
to Active.
25
Delete Operation
To specify deletion, a condition on the attributes of the relation selects the tuple to be deleted.
The Delete operation could violate referential integrity if the tuple which is deleted is referenced by
foreign keys from other tuples in the same database.
Select Operation
For example, the integrity of data in the pubs2 and pubs3 databases requires that a book title in
the titles table must have a publisher in the publishers table.
You cannot insert books that do not have a valid publisher into titles, because it violates the data
integrity of pubs2 or pubs3.
Transact-SQL provides several mechanisms for integrity enforcement in a database such as rules,
defaults, indexes, and triggers. These mechanisms allow you to maintain these types of data integrity:
• Requirement – requires that a table column must contain a valid value in every row; it cannot allow
null values. The create table statement allows you to restrict null values for a column.
• Check or validity – limits or restricts the data values inserted into a table column. You can use
triggers or rules to enforce this type of integrity.
• Uniqueness – no two table rows can have the same non-null values for one or more table columns.
You can use indexes to enforce this integrity.
• Referential – data inserted into a table column must already have matching data in another table
column or another column in the same table. A single table can have up to 192 references.
As an alternative to using rules, defaults, indexes, and triggers, Transact-SQL provides a series of integrity
constraints as part of the create table statement to enforce data integrity as defined by the SQL standards.
These integrity constraints are described later in this chapter.
➢ Do you want to find all students younger than 18 or all students enrolled in registration number 203
➢ Students with age less than 18 on instance instance 51
➢ We can retrieve rows corresponding to students who are younger than 18 with the following query
➢ Select *from students S where s. age < 18.
➢ Select is a command comes under data query language of SQL commands
➢ Select statement retrieves zero or more rows from one or more database tables
➢ Select most commonly used data manipulation language command
➢ The symbol * means that we relation all fields of selected tuples in the result
➢ ‘S’ as a variable that takes a on the value of each tuple in students table one double after other
➢ The condition s.age < 18 in the where clause specify that we want to select only tuples when the age
field has a value less than 18
➢ This Query evaluates to the relation show fig1
➢ In addition to selecting a subset of tuples a query can extract a subset of fields of each selected tuple
➢ We can compute the of names of login
➢ We can compute the names of logins of students who are younger than 18 with the following query
SELECT s.name, s.login FROM students S WHERE s.age<18;
➢ The Output for the query:
name Login
Rameswaran rameswaran@GATE
Krishnan Krishnan@GATE
27
View can make the application and database tables to a certain extent independent. If there is no
view, the application must be based on a table. With the view, the program can be established in view of
above, to view the program with a database table to be separated.
Disadvantages of views:
➢ Performance:
Views create the appearance of a table, but the DBMS must still translate queries against the view into
queries against the underlying source tables. If the view is defined by a complex, multi-table query then
simple queries on the views may take considerable time.
➢ Update restrictions:
When a user tries to update rows of a view, the DBMS must translate the request into an update on rows of
the underlying source tables. This is possible for simple views, but more complex views are often restricted
to read-only.
Types of Views:
➢ Simple View: A view based on only a single table, which doesn't contain GROUP BY clause and any
functions.
➢ Complex View: A view based on multiple tables, which contain GROUP BY clause and functions.
➢ Inline View: A view based on a subquery in FROM Clause, that subquery creates a temporary table
and simplifies the complex query.
➢ Materialized View: A view that stores the definition as well as data. It creates replicas of data by
storing it physically.
Student_Detail
1 Stephan Delhi
2 Kathrin Noida
3 David Ghaziabad
4 Alina Gurugram
Student_Marks
STU_ID NAME MARKS AGE
1 Stephan 97 19
2 Kathrin 86 21
3 David 74 18
4 Alina 90 20
5 John 96 18
1. Creating view
29
A view can be created using the CREATE VIEW statement. We can create a view from a single
table or multiple tables.
Syntax:
CREATE VIEW view_name AS SELECT column1, column2..... FROM table_name WHERE condition;
2. Creating View from a single table
In this example, we create a View named DetailsView from the table Student_Detail.
Query:
CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM Student_Details
WHERE STU_ID < 4;
Just like table query, we can query the view to view the data.
SELECT * FROM DetailsView;
Output:
NAME ADDRESS
Stephan Delhi
Kathrin Noida
David Ghaziabad
3. Creating View from multiple tables:
View from multiple tables can be created by simply include multiple tables in the SELECT statement.
In the given example, a view is created named MarksView from two tables Student_Detail and
Student_Marks.
Query:
CREATE VIEW MarksView AS SELECT Student_Detail.NAME, Student_Detail.ADDRESS,
Student_Marks.MARKS FROM Student_Detail, Student_Mark WHERE Student_Detail.NAME =
Student_Marks.NAME;
To display data of View MarksView:
SELECT * FROM MarksView;
Stephan Delhi 97
Kathrin Noida 86
David Ghaziabad 74
Alina Gurugram 90
4. Deleting View
A view can be deleted using the Drop View statement.
Syntax
DROP VIEW view_name;
Example:
30
Relational algebra is a procedural query language that works on relational model. The purpose of a
query language is to retrieve data from database or perform various operations such as insert, update, delete
on the data. When I say that relational algebra is a procedural query language, it means that it tells what data
to be retrieved and how to be retrieved.
On the other hand relational calculus is a non-procedural query language, which means it tells what
data to be retrieved but doesn’t tell how to retrieve it. We will discuss relational calculus in a separate
tutorial.
σ Customer_City="Agra" (CUSTOMER)
Output:
Customer_Name Customer_City
------------- -------------
Steve Agra
Raghu Agra
Chaitanya Noida
Ajeet Delhi
Carl Delhi
Union Operator (∪)
Union operator is denoted by ∪ symbol and it is used to select all the rows (tuples) from two tables
(relations).
Lets discuss union operator a bit more. Lets say we have two relations R1 and R2 both have same
columns and we want to select all the tuples(rows) from these relations then we can apply the union operator
on these relations.
Note: The rows (tuples) that are present in both the tables will only appear once in the union set. In
short you can say that there are no duplicates present after the union operation.
Syntax of Union Operator (∪)
table_name1 ∪ table_name2
Union Operator (∪) Example
Table 1: COURSE
33
Student_Name
------------
Aditya
Carl
Paul
Lucy
Rick
Steve
Note: As you can see there are no duplicate names present in the output even though we had few
common names in both the tables, also in the COURSE table we had the duplicate name itself.
Student_Name
------------
Aditya
Steve
Paul
Lucy
Set Difference is denoted by – symbol. Lets say we have two relations R1 and R2 and we want to
select all those tuples(rows) that are present in Relation R1 but not present in Relation R2, this can be done
using Set difference R1 – R2.
Syntax of Set Difference (-)
table_name1 - table_name2
Set Difference (-) Example
Lets take the same tables COURSE and STUDENT that we have seen above.
Query:
Lets write a query to select those student names that are present in STUDENT table but not present in
COURSE table.
∏ Student_Name (STUDENT) - ∏ Student_Name (COURSE)
Output:
Student_Name
------------
Carl
Rick
Cartesian Product is denoted by X symbol. Lets say we have two relations R1 and R2 then the
cartesian product of these two relations (R1 X R2) would combine each tuple of first relation R1 with the
each tuple of second relation R2. I know it sounds confusing but once we take an example of this, you will
be able to understand this.
35
R1 X R2
Cartesian product (X) Example
Table 1: R
Col_A Col_B
----- ------
AA 100
BB 200
CC 300
Table 2: S
Col_X Col_Y
----- -----
XX 99
YY 11
ZZ 101
Query:
Lets find the cartesian product of table R and S.
RXS
Output:
Table: CUSTOMER
Customer_Id Customer_Name Customer_City
----------- ------------- -------------
C10100 Steve Agra
C10111 Raghu Agra
C10115 Chaitanya Noida
C10117 Ajeet Delhi
C10118 Carl Delhi
Query:
ρ(CUST_NAMES, ∏(Customer_Name)(CUSTOMER))
Output:
CUST_NAMES
----------
Steve
Raghu
Chaitanya
Ajeet
Carl
Notice that between the two tables there is a common column (dimension) highlighted in green, User
ID. In the User Table, the ID column is the user ID and it’s the primary key for that table whereas, in the
Event Table, the User_ID column is the foreign key since that column refers to the ID column in the Users
table. We can use this relationship to join the two tables together to get the user and events information in
one table.
Inner Join
What if you want to have a table that contains only users that have done an action?
You would use an Inner Join to join the tables together. An inner join combines the columns on a
common dimension (the first N columns) when possible, and only includes data for the columns that share
the same values in the common N column(s). In the example, the User ID would be the common dimension
used for the inner join.
Left Join
Now, what if you want to have a table that contains all the users’ data and only actions that those users
have done? Actions performed by other users not in the users table should not be included?
You would use a Left Join to join the tables together. A left join combines the columns on a common
dimension (the first N columns) when possible, returning all rows from the first table with the
matching rows in the consecutive tables. The result is NULL in the consecutive tables when there is no
match. In this case, we would make the User Table the first (left table) to use for the left join.
38
A good use case for this would be if you’re looking to combine two tables by appending them rather
than joining them. A Cross Join would result in a table with all possible combinations of your tables’ rows
together. This can result in enormous tables and should be used with caution.
Cross Joins will likely only be used when your tables contain single values that you want to join
together without a common dimension.
Relational calculus is a non-procedural query language that tells the system what data to be retrieved
but doesn’t tell how to retrieve it. the user is concerned with the details of how to obtain the end results.
➢ The relational calculus tells what to do but never explains how to do.
39
Notation:
{T | P (T)} or {T | Condition (T)}
Where T is the resulting tuples
P(T) is the condition used to fetch T.
For example:
{ T.name | Author(T) AND T.article = 'database' }
OUTPUT: This query selects the tuples from the AUTHOR relation. It returns a tuple with
'name' from Author who has written an article on 'database'.
TRC (tuple relation calculus) can be quantified. In TRC, we can use Existential (∃) and Universal
Quantifiers (∀).
For example:
{ R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}
Output: This query will yield the same result as the previous one.
Query to display the last name of those students where age is greater than 30
40
Last_Name
---------
Singh
Query to display all the details of students where Last name is ‘Singh’
Output:
First_Name Age
---------- ----
Ajeet 30
Chaitanya 31
Carl 28
➢ a query language is said to relationally complete, if it can express all the queries that can be
expressed in relational algebra (SQL).
➢ SQL is relationally complete.
➢ SQL produces additional expressive power beyond relational algebra.
➢ Expressive power (theorem due to codd) every query that can be expressed in relational algebra can
be expressed as safe query in relational calculus, the converse Is also true.
Relational completeness:
➢ Query language (sql) can express query that is expressible in relational algebra/calculus (actually
SQL is macro powerful)
A Remark: unsafe query
∃ syntactically correct calculus queries that have am initiate no.of answers! Unsafe queries.
Ex: {s|┓ (s∈ sailors)}
➢ A query which yields an infinite (quasi) answers said to be unsafe, of course should not allowed by
system. It is possible to define a safe formula in TRC& DRC
Safety:
Certain queries stated in the relational calculus may lead to answers which contains am infinite no.of
tuples or atleast as many as the system can handle.
42
UNIT -3
INTRODUCTION
• Date & Time – The DATE data type has: YEAR, MONTH, and DAY in the form YYYY-
MM-DD. Similarly, the TIME data type has the components HOUR, MINUTE, and
SECOND in the form HH:MM: SS. These formats can change based on the requirement.
• Timestamp & Interval – The TIMESTAMP data type includes a minimum of six positions,
for decimal fractions of seconds and an optional WITH TIME ZONE qualifier in addition to
the DATE and TIME fields. The INTERVAL data type mentions a relative value that can be
used to increment or decrement an absolute value of a date, time, or timestamp.
Access mysql:
In order to access your MySQL database, please follow these steps:
1. Log into your Linux web server via Secure Shell.
2. Open the MySQL client program on the server in the /usr/bin directory.
3. Type in the following syntax to access your database:
$ mysql -h {hostname} -u username -p {databasename}
Password: {your password}
hostname: the name of the MySQL server that you are assigned to, for example,
mysql4.safesecureweb.com
databasename: the name of your MySQL database
password: the password you use to access your MySQL database
SQL MySQL
It’s a query language. It’s database software. It uses SQL as a language to query
the database.
EXAMPLE
ALTER TABLE STU_DETAILS ADD(ADDRESS VARCHAR2(20));
ALTER TABLE STU_DETAILS MODIFY (NAME VARCHAR2(20));
d. TRUNCATE: It is used to delete all the rows from the table and free the space containing the table.
Syntax:
TRUNCATE TABLE table_name;
Example:
TRUNCATE TABLE EMPLOYEE;
o INSERT
o UPDATE
o DELETE
INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of a table.
Syntax:
INSERT INTO TABLE_NAME (col1, col2, …. col N) VALUES (value1, value2, , .... valueN);
Or
INSERT INTO TABLE_NAME VALUES (value1, value2, value3, .... valueN);
For example:
INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS");
47
UPDATE: This command is used to update or modify the value of a column in the table.
Syntax:
UPDATE table_name SET [column_name1= value1,...column_nameN = valueN]
[WHERE CONDITION]
For example:
UPDATE students SET User_Name = 'Sonoo' WHERE Student_Id = '3'
DELETE: It is used to remove one or more row from a table.
Syntax:
DELETE FROM table_name [WHERE condition];
For example:
DELETE FROM javatpoint WHERE Author="Sonoo";
LIMIT clause
The LIMIT clause is used in the SELECT statement to constrain the number of rows to return.
The LIMIT clause accepts one or two arguments. The values of both arguments must be zero or
positive integers.
The following illustrates the LIMIT clause syntax with two arguments:
SELECT
select_list
FROM
table_name
LIMIT [offset,] row_count;
Code language: SQL (Structured Query Language) (sql)
In this syntax:
• The offset specifies the offset of the first row to return. The offset of the first row is 0, not 1.
• The row_count specifies the maximum number of rows to return.
Sort Results:
This MySQL tutorial explains how to use the MySQL ORDER BY clause with syntax and examples.
Description
The MySQL ORDER BY clause is used to sort the records in your result set.
Syntax
The syntax for the ORDER BY clause in MySQL is:
SELECT expressions
FROM tables
[WHERE conditions]
ORDER BY expression [ ASC | DESC ];
49
SQL OR Operator
The OR operator in SQL shows the record from the table if any of the conditions separated by the OR
operator evaluates to True. It is also known as the conjunctive operator and is used with the WHERE clause.
Syntax of OR operator:
SELECT column1, ...., columnN FROM table_Name WHERE condition1 OR condition2 OR cond
ition3 OR ....... OR conditionN;
SQL BETWEEN Operator
The BETWEEN operator in SQL shows the record within the range mentioned in the SQL query.
This operator operates on the numbers, characters, and date/time operands.
If there is no value in the given range, then this operator shows NULL value.
Syntax of BETWEEN operator:
SELECT column_Name1, column_Name2 ...., column_NameN FROM table_Name WHERE
column_nameBETWEEN value1 and value2;
50
SQL IN Operator
The IN operator in SQL allows database users to specify two or more values in a WHERE clause.
This logical operator minimizes the requirement of multiple OR conditions.
This operator makes the query easier to learn and understand. This operator returns those rows whose
values match with any value of the given list.
Syntax of IN operator:
SELECT column_Name1, column_Name2 ...., column_NameN FROM table_Name WHERE
column_name IN (list_of_values);
LCASE()
MySQL LCASE() converts the characters of a string to lower case characters.
Syntax : LCASE(str)
Example : SELECT LCASE('MYTESTSTRING');
Output : myteststring
LENGTH()
MySQL LENGTH() returns the length of a given string.
Syntax : LENGTH(str)
Example : SELECT LENGTH('text');
Output : 4
LOWER()
MySQL LOWER() converts all the characters in a string to lowercase characters.
Syntax : LOWER(str)
Example : SELECT LOWER('MYTESTSTRING');
Output : myteststring
UPPER()
MySQL UPPER() converts all the characters in a string to uppercase characters.
Syntax : UPPER(str)
Example : SELECT UPPER('myteststring');
Output : MYTESTSTRING
SUBSTR()
MySQL SUBSTR() returns the specified number of characters from a particular position of a given
string. SUBSTR() is a synonym for SUBSTRING().
Syntax : SUBSTR(str,pos,len)
Example : SELECT SUBSTR('w3resource',4,3);
Output : eso
CHAR()
CHAR() interprets each argument N as an integer and returns a string consisting of the characters
given by the code values of those integers. NULL values are skipped.
Syntax : CHAR(N,... [USING charset_name])
Example : SELECT CHAR(77,121,83,81,'76');
Output : MySQL
CONCAT()
Returns the string that results from concatenating one or more arguments. If all arguments are
nonbinary strings, the result is a nonbinary string. If the arguments include any binary strings, the
result is a binary string. A numeric argument is converted to its equivalent nonbinary string form.
Syntax : CONCAT(str1,str2,...)
Example : SELECT CONCAT('w3resource','.','com');
Output : w3resource.com
LTRIM(str)
MySQL LTRIM() removes the leading space characters of a string passed as argument.
Syntax : LTRIM(str)
Example : SELECT LTRIM(' Hello')
Output : Hello ( leading spaces have been exclude)
LEFT()
MySQL LEFT() returns a specified number of characters from the left of a given string. Both the
number and the string are supplied in the arguments as str and len of the function.
Syntax : LEFT(str,len)
Example : SELECT LEFT('w3resource', 3);
Output : w3r
53
RIGHT()
MySQL RIGHT() extracts a specified number of characters from the right side of a given string.
Syntax : RIGHT(str,len)
Example : SELECT RIGHT('w3resource',8);
Output : resource
MID()
MySQL MID() extracts a substring from a string. The actual string, position to start extraction and
length of the extracted string - all are specified as arguments.
Syntax : MID(str,pos,len)
Example : SELECT MID('w3resource',4,3);
Output : eso
REPLACE()
MySQL REPLACE() replaces all the occurrences of a substring within a string.
Syntax : REPLACE(str,from_str,to_str)
Example : SELECT REPLACE('w3resource','ur','r');
Output : w3resorce
REPEAT()
MySQL REPEAT() repeats a string for a specified number of times.
The function returns NULL either any either of the arguments are NULL.
Syntax : REPEAT(str,count)
Example : SELECT REPEAT('**-',5);
Output : **-**-**-**-**-
LOCATE()
MySQL LOCATE() returns the position of the first occurrence of a string within a string. Both of
these strings are passed as arguments. An optional argument may be used to specify from which position of
the string (i.e. string to be searched) searching will start. If this position is not mentioned, searching starts
from the beginning.
Syntax : LOCATE(substr,str,pos);
Example : SELECT LOCATE('st','myteststring');
Output : 5
REVERSE()
Returns a given string with the order of the characters reversed.
Syntax : REVERSE(str)
Example : SELECT REVERSE('w3resource');
Output : ecruoser3w
FIELD()
Returns the index (position) of str in the str1, str2, str3, ... list. Returns 0 if str is not found.
Syntax : FIELD(str,str1,str2,str3,...)
Example : SELECT FIELD('ank', 'b', 'ank', 'of', 'monk');
Output : 2
STRCMP()
MySQL strcmp() function is used to compare two strings. It returns 0 if both of the strings are same
and returns -1 when the first argument is smaller than the second according to the defined order and 1 when
the second one is smaller the first one.
Syntax:
STRCMP (expr1, expr2)
54
Code:
SELECT STRCMP('mytesttext', 'mytesttext');
Sample Output:
mysql> SELECT STRCMP('mytesttext', 'mytesttext');
+------------------------------------+
| STRCMP('mytesttext', 'mytesttext') |
+------------------------------------+
| 0|
+------------------------------------+
1 row in set (0.01 sec)
MySQL STRCMP() function with unmatched strings
SQL Aggregate Functions
o SQL aggregation function is used to perform the calculations on multiple rows of a single column of
a table. It returns a single value.
o It is also used to summarize the data.
Types of SQL Aggregation Function
1. COUNT FUNCTION
o COUNT function is used to Count the number of rows in a database table. It can work on both
numeric and non-numeric data types.
o COUNT function uses the COUNT(*) that returns the count of all the rows in a specified table.
COUNT(*) considers duplicate and Null.
Syntax
COUNT(*) or
COUNT( [ALL|DISTINCT] expression )
Sample table:
PRODUCT_MAST
PRODUCT COMPANY QTY RATE COST
Item1 Com1 2 10 20
Item2 Com2 3 25 75
Item3 Com1 2 30 60
Item4 Com3 5 10 50
Item5 Com2 2 20 40
55
Example: COUNT()
SELECT COUNT(*) FROM PRODUCT_MAST;
2. SUM Function
Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.
Syntax
SUM()
or
SUM( [ALL|DISTINCT] expression )
Example: SUM()
SELECT SUM(COST) FROM PRODUCT_MAST;
3. AVG function
The AVG function is used to calculate the average value of the numeric type. AVG function returns
the average of all non-Null values.
Syntax
AVG()
or
AVG( [ALL|DISTINCT] expression )
Example:
SELECT AVG(COST) FROM PRODUCT_MAST;
4. MAX Function
MAX function is used to find the maximum value of a certain column. This function determines the
largest value of all selected values of a column.
Syntax
MAX()
or
MAX( [ALL|DISTINCT] expression )
Example:
SELECT MAX(RATE) FROM PRODUCT_MAST;
5. MIN Function
MIN function is used to find the minimum value of a certain column. This function determines the
smallest value of all selected values of a column.
Syntax
MIN()
or
MIN( [ALL|DISTINCT] expression )
Example:
SELECT MIN(RATE) FROM PRODUCT_MAST;
MySQL Date Functions
The following table lists the most important built-in date functions in MySQL:
Function Description
Function Description
Result: NULL
• Delete Anomaly: When a project is deleted, it will result in deleting all the employees who
work on that project. Alternately, if an employee is the sole employee on a project, deleting
that employee would result in deleting the corresponding project.
• Attributes that are NULL frequently could be placed in separate relations (with the primary
key)
–
• Definition
59
• Functional dependencies (FDs) are used to specify formal measures of the "goodness" of
relational designs
• FDs and keys are used to define normal forms for relations
• FDs are constraints that are derived from the meaning and interrelationships of the data
attributes
• A set of attributes X functionally determines a set of attributes Y if the value of X
determines a unique value for Y
•
– X -> Y holds if whenever two tuples have the same value for X, they must have the
same value for Y
– For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then
t1[Y]=t2[Y]
– X -> Y in R specifies a constraint on all relation instances r(R)
– Written as X -> Y; can be displayed graphically on a relation schema as in Figures. (
denoted by the arrow: ).
– FDs are derived from the real-world constraints on the attributes
• Examples of FD
– social security number determines employee name
SSN -> ENAME
– project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
– employee ssn and project number determines the hours per week that the employee
works on the project
{SSN, PNUMBER} -> HOURS
– An FD is a property of the attributes in the schema R
– The constraint must hold on every relation instance r(R)
– If K is a key of R, then K functionally determines all attributes in R (since we never
have two distinct tuples with t1[K]=t2[K])
Inference Rules for FDs
• Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold
Armstrong's inference rules:
IR1. (Reflexive) If Y subset-of X, then X -> Y
IR2. (Augmentation) If X -> Y, then XZ -> YZ
(Notation: XZ stands for X U Z)
IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z
• IR1, IR2, IR3 form a sound and complete set of inference rules
Some additional inference rules that are useful:
(Decomposition) If X -> YZ, then X -> Y and X -> Z
(Union) If X -> Y and X -> Z, then X -> YZ
(Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z
• The last three inference rules, as well as any other inference rules, can be deduced from IR1,
IR2, and IR3 (completeness property)
• Closure of a set F of FDs is the set F+ of all FDs that can be inferred from F
• Closure of a set of attributes X with respect to F is the set X + of all attributes that are
functionally determined by X
• A Nonprime attribute is not a prime attribute—that is, it is not a member of any candidate
key.
Dependencies in DBMS is a relation between two or more attributes. It has the following types
in DBMS −
• Functional Dependency
• Fully-Functional Dependency
• Transitive Dependency
• Multivalued Dependency
• Partial Dependency
Let us start with Functional Dependency −
Functional Dependency
If the information stored in a table can uniquely determine another information in the same
table, then it is called Functional Dependency. Consider it as an association between two attributes
of the same relation.
If P functionally determines Q, then
P -> Q
In the above table, EmpName is functionally dependent on EmpID because EmpName can take
only one value for the given value of EmpID:
ProjectID ProjectCost
001 1000
002 5000
<EmployeeProject>
Transitive Dependency
When an indirect relationship causes functional dependency it is called Transitive
Dependency.
If P -> Q and Q -> R is true, then P-> R is a transitive dependency.
Multivalued Dependency
When existence of one or more rows in a table implies one or more other rows in the same
table, then the Multi-valued dependencies occur.
If a table has attributes P, Q and R, then Q and R are multi-valued facts of P.
It is represented by double arrow −
->->
P->->Q
Q->->R
In the above case, Multivalued Dependency exists only if Q and R are independent attributes.
Partial Dependency
Partial Dependency occurs when a nonprime attribute is functionally dependent on part of a
candidate key.
The 2nd Normal Form (2NF) eliminates the Partial Dependency. Let us see an example −
<StudentProject>
Example:
Consider a table with two columns Employee_Id and Employee_Name.
64
Norm Description
al Form
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully
functional dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and has no
multi-valued dependency.
5NF A relation is in 5NF if it is in 4NF and not contains any join dependency
and joining should be lossless.
First Normal Form (1NF)
o A relation will be 1NF if it contains an atomic value.
o It states that an attribute of a table cannot hold multiple values. It must hold only single-valued
attribute.
o First normal form disallows the multi-valued attribute, composite attribute, and their combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE.
EMPLOYEE table:
65
14 John 7272826385, UP
9064738238
14 John 7272826385 UP
14 John 9064738238 UP
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which is a
proper subset of a candidate key. That's why it violates the rule for 2NF.
To convert the given table into 2NF, we decompose it into two tables:
66
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
Third Normal Form (3NF)
o A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must be in third normal
form.
A relation is in third normal form if it holds atleast one of the following conditions for every non-
trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
EMP EMP_NA EMP_ EMP_STA EMP_CI
_ID ME ZIP TE TY
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on
EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_ZIP
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
Boyce Codd normal form (BCNF)
o BCNF is the advance version of 3NF. It is stricter than 3NF.
o A table is in BCNF if every functional dependency X → Y, X is the super key of the table.
o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one department.
EMPLOYEE table:
EM EMP_COUN EMP_D DEPT_T EMP_DEPT
P_ID TRY EPT YPE _NO
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
Difference between structure and union in Hindi
Keep Watching
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Fourth normal form (4NF)
o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation will
be a multi-valued dependency.
Example
STUDENT
STU_ID COURSE HOBBY
21 Computer Dancing
69
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity.
Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and
two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to
unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables:
HTML Tutorial
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Fifth normal form (5NF)
o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be
lossless.
o 5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid
redundancy.
o 5NF is also known as Project-join normal form (PJ/NF).
Example
SUBJECT LECTURER SEMESTER
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
71
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
72
UNIT-4
Introduction to transaction processing
Transaction processing
• The transaction is a set of logically related operation. It contains a group of tasks.
• A transaction is an action or series of actions. It is performed by a single user to perform
operations for accessing the contents of the database.
Example: Suppose an employee of bank transfers Rs.800 from X's account to Y's account.
This small transaction contains several low-level tasks:
X's Account :
1. Open_Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close_Account(X)
Y's Account
1. Open_Account(Y)
2. Old_Balance = Y.balance
3. New_Balance = Old_Balance + 800
4. Y.balance = New_Balance
5. Close_Account(Y)
Operations of Transaction:
Following are the main operations of transaction : in SQL
Read(X): Read operation is used to read the value of X from the database and stores it in a
buffer in main memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
Let's take an example to debit transaction from an account which consists of following
operations:
1. R(X); // Read
2. X = X - 500; //Update
3. W(X); // Write
Commit: It is used to save the work done permanently.
TRANSACTION NOTATION
• ROLLBACK (or ABORT). This signals that the transaction has ended unsuccessfully, so
that any changes or effects that the transaction may have applied to the database must
be undone.
• Figure 21.4 shows a state transition diagram that illustrates how a transaction moves through
its execution states.
• A transaction goes into an active state immediately after it starts execution, where it can
execute its READ and WRITE operations.
• When the transaction ends, it moves to the partially committed state.
• At this point, some recovery protocols need to ensure that a system failure will not result in an
inability to record the changes of the transaction permanently (usually by recording changes in
the system log, discussed in the next section).
• Once this check is successful, the transaction is said to have reached its commit point and
enters the committed state.
• When a transaction is committed, it has concluded its execution successfully and all its
changes must be recorded permanently in the database, even if a system failure occurs.
• However, a transaction can go to the failed state if one of the checks fails or if the transaction
is aborted during its active state.
• The transaction may then have to be rolled back to undo the effect of its WRITE operations on
the database.
• The terminated state corresponds to the transaction leaving the system.
• The transaction information that is maintained in system tables while the transaction has been
running is removed when the transaction terminates.
• Failed or aborted transactions may be restarted later—either automatically or after being
resubmitted by the user—as brand new transactions.
(2).The System Log
• To be able to recover from failures that affect transactions, the system maintains
a log to keep track of all transaction operations that affect the values of database items,
as well as other transaction information that may be needed to permit recovery from
failures.
• The log is a sequential, append-only file that is kept on disk, so it is not affected by any
type of failure except for disk or catastrophic failure.
75
• Typically, one (or more) main memory buffers hold the last part of the log file, so that
log entries are first added to the main memory buffer.
• When the log buffer is filled, or when certain other conditions occur, the log buffer
is appended to the end of the log file on disk. In addition, the log file from disk is
periodically backed up to archival storage (tape) to guard against catastrophic failures.
• The following are the types of entries—called log records—that are written to the log
file and the corresponding action for each log record.
• In these entries, T refers to a unique transaction-id that is generated automatically by
the system for each transaction and that is used to identify each transaction:
• Protocols for recovery that avoid cascading rollbacks (see Section 21.4.2)—which
include nearly all practical protocols—do not require that READ operations are writ-ten
to the system log. However, if the log is also used for other purposes—such as auditing
(keeping track of all database operations)—then such entries can be included.
Additionally, some recovery protocols require simpler WRITE entries only include one
of new_value and old_value instead of including both
(3). Commit Point of a Transaction
• A transaction T reaches its commit point when all its operations that access the database
have been executed successfully and the effect of all the transaction operations on the
database have been recorded in the log.
• Beyond the commit point, the transaction is said to be committed, and its effect must
be permanently recorded in the database.
• The transaction then writes a commit record [commit, T] into the log.
• If a system failure occurs, we can search back in the log for all transactions T that have written
a [start_transaction, T] record into the log but have not written their [commit, T] record yet;
these transactions may have to be rolled back to undo their effect on the database during the
recovery process.
• Transactions that have written their commit record in the log must also have recorded all
their WRITE operations in the log, so their effect on the database can be redone from the log
records.
• Notice that the log file must be kept on disk, updating a disk file involves copying the
appropriate block of the file from disk to a buffer in main memory, updating the buffer in
main memory, and copying the buffer to disk.
• It is common to keep one or more blocks of the log file in main memory buffers, called
the log buffer, until they are filled with log entries and then to write them back to disk only
once, rather than writing to disk every time a log entry is added.
76
• This saves the overhead of multiple disk writes of the same log file buffer.
• At the time of a system crash, only the log entries that have been written back to disk are
considered in the recovery process because the contents of main memory may be lost.
• Hence, before a transaction reaches its commit point, any portion of the log that has not been
written to the disk yet must now be written to the disk.
• This process is called force-writing the log buffer before committing a transaction.
Desirable Properties of Transactions
• Transactions should possess several properties, often called the ACID properties; they should
be enforced by the concurrency control and recovery methods of the DBMS. The following
are the ACID properties:
• Atomicity. A transaction is an atomic unit of processing; it should either be performed in its
entirety or not performed at all.
o The atomicity property requires that we execute a transaction to completion.
o It is the responsibility of the transaction recovery subsystem of a DBMS to ensure
atomicity.
o If a transaction fails to complete for some reason, such as a system crash .
o On the other hand, write operations of a committed transaction must be eventually
written to disk.
• Consistency preservation. A transaction should be consistency preserving, meaning that if it
is completely executed from beginning to end without interference from other transactions, it
should take the database from one consistent state to another.
o The preservation of consistency is generally considered to be the responsibility of the
programmers who write the database programs or of the DBMS module that enforces
integrity constraints.
o Recall that a database state is a collection of all the stored data items (values) in the
database at a given point in time.
o A consistent state of the database satisfies the constraints specified in the schema as
well as any other constraints on the database that should hold.
o A database program should be written in a way that guarantees that, if the database is
in a consistent state before executing the transaction, it will be in a consistent state
after the complete execution of the transaction, assuming that no interference with
other transactions occurs.
• Isolation. A transaction should appear as though it is being executed in isolation from other
transactions, even though many transactions are executing concurrently. That is, the execution
of a transaction should not be interfered with by any other transactions executing
concurrently.
o The isolation property is enforced by the concurrency control subsystem of the DBMS.
o If every transaction does not make its updates (write operations) visible to other
transactions until it is committed, one form of isolation is enforced that solves the
temporary update problem and eliminates cascading rollbacks, but does not eliminate all
other problems.
o There have been attempts to define the level of isolation of a transaction.
o A transaction is said to have level 0 (zero) isolation if it does not overwrite the dirty reads
of higher-level transactions.
77
o Level 1 (one) isolation has no lost updates, and level 2 isolation has no lost updates and
no dirty reads.
o Finally, level 3 isolation (also called true isolation) has, in addition to level 2 properties,
repeatable reads.
• Durability or permanency. The changes applied to the database by a com-mitted transaction
must persist in the database.
o These changes must not be lost because of any failure.
o And last, the durability property is the responsibility of the recovery subsystem of the
DBMS.
•
We have discussed-
• A schedule is the order in which the operations of multiple transactions appear for execution.
• Serial schedules are always consistent.
78
Serializability in DBMS-
Serializable Schedules-
If a given non-serial schedule of ‘n’ transactions is equivalent to some serial schedule of ‘n’
transactions, then it is called as a serializable schedule.
Characteristics-
Serializable schedules behave exactly same as serial schedules.
Thus, serializable schedules are always-
• Consistent
• Recoverable
• Casacadeless
• Strict
No concurrency is allowed.
Concurrency is allowed.
Thus, all the transactions
Thus, multiple transactions can
necessarily execute serially one after the
execute concurrently.
other.
1. Conflict Serializability
1.2. View Serializability
79
Conflict Serializability-
If a given non-serial schedule can be converted into a serial schedule by swapping its non-
conflicting operations, then it is called as a conflict serializable schedule.
Conflicting Operations-
Two operations are called as conflicting operations if all the following conditions hold true
for them-
• Both the operations belong to different transactions
• Both the operations are on the same data item
• At least one of the two operations is a write operation
Example-
Consider the following schedule-
In this schedule,
• W1 (A) and R2 (A) are called as conflicting operations.
• This is because all the above conditions hold true for them.
Step-02: Start creating a precedence graph by drawing one node for each transaction.
Step-03:
• Draw an edge for each conflict pair such that if Xi (V) and Yj (V) forms a conflict pair then draw
an edge from Ti to Tj.
• This ensures that Ti gets executed before Tj.
Step-04:
Check if there is any cycle formed in the graph.
• If there is no cycle found, then the schedule is conflict serializable otherwise not.
Non-Serializable Schedules-
• A non-serial schedule which is not serializable is called as a non-serializable schedule.
80
• A non-serializable schedule is not guaranteed to produce the the same effect as produced
by some serial schedule on any consistent database.
Characteristics-
Non-serializable schedules-
• may or may not be consistent
• may or may not be recoverable
Irrecoverable Schedules-
If in a schedule,
• A transaction performs a dirty read operation from an uncommitted transaction
• And commits before the transaction from which it has read the value
then such a schedule is known as an Irrecoverable Schedule.
Example- Consider the following schedule-
Here,
•T2 performs a dirty read operation.
• T2 commits before T1.
• T1 fails later and roll backs.
• The value that T2 rea Characterizing Schedules Based on Recoverabilityd
now stands to be incorrect.
• T2 can not recover since it has already committed.
Recoverable Schedules-
81
If in a schedule,
• A transaction performs a dirty read operation from an uncommitted transaction
• And its commit operation is delayed till the uncommitted transaction either commits or roll backs
then such a schedule is known as a Recoverable Schedule.
Here,
• The commit operation of the transaction that performs the dirty read is delayed.
• This ensures that it still has a chance to recover if the uncommitted transaction fails later.
Example- Consider the following schedule-
Here,
•
T2 performs a dirty read operation.
• The commit operation of T2 is delayed till T1 commits or roll backs.
• T1 commits later.
• T2 is now allowed to commit.
• In case, T1 would have failed, T2 has a chance to recover by rolling back.
Checking Whether a Schedule is Recoverable or Irrecoverable-
Method-01:
Check whether the given schedule is conflict serializable or not.
• If the given schedule is conflict serializable, then it is surely recoverable. Stop and report your
answer.
• If the given schedule is not conflict serializable, then it may or may not be recoverable. Go and
check using other methods.
82
Thumb Rules
• All conflict serializable schedules are recoverable.
• All recoverable schedules may or may not be conflict serializable.
Method-02:
Check if there exists any dirty read operation.
(Reading from an uncommitted transaction is called as a dirty read)
• If there does not exist any dirty read operation, then the schedule is surely recoverable. Stop and
report your answer.
• If there exists any dirty read operation, then the schedule may or may not be recoverable.
If there exists a dirty read operation, then follow the following cases-
Case-01:
If the commit operation of the transaction performing the dirty read occurs before the commit
or abort operation of the transaction which updated the value, then the schedule is irrecoverable.
Case-02:
If the commit operation of the transaction performing the dirty read is delayed till the commit
or abort operation of the transaction which updated the value, then the schedule is recoverable.
Concurrency Control deals with interleaved execution of more than one transaction. In the
next article, we will see what is serializability and how to find whether a schedule is serializable or
not.
What is Transaction?
• The first operation reads the value of A from database and stores it in a buffer.
• Second operation will decrease its value by 1000. So buffer will contain 4000.
• Third operation will write the value from buffer to database. So A’s final value will be
4000.
But it may also be possible that transaction may fail after executing some of its operations.
The failure can be because of hardware, software or power etc. For example, if debit transaction
discussed above fails after executing operation 2, the value of A will remain 5000 in the database
which is not acceptable by the bank. To avoid this, Database has two important operations:
Commit: After all instructions of a transaction are successfully executed, the changes made
by transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the changes
made by transaction are undone.
Properties of a transaction
Atomicity: As a transaction is set of logically related operations, either all of them should
be executed or none. A debit transaction discussed above should either execute all three operations
or none.If debit transaction fails after executing operation 1 and 2 then its new value 4000 will not
be updated in the database which leads to inconsistency.
Consistency: If operations of debit and credit transactions on same account are executed
concurrently, it may leave database in an inconsistent state.
For Example, T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches inconsistent state.
A=5
000
R A=5
(A); A=5000 000
R(A A=5
A=5000 ); A=5000 000
A
=A- A=5
1000; A=4000 A=5000 000
84
A=A
A=4000 +500; A=5500
W A=4
(A); A=5500 000
W( A=5
A); 500
• Let us assume Account balance of A is Rs. 5000. T1 reads A(5000) and stores the value in its
local buffer space. Then T2 reads A(5000) and also stores the value in its local buffer space.
• T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space. Then T2
performs A=A+500 (5000+500=5500) and 5500 is stored in T2 buffer space. T1 writes the value
from its buffer back to database.
• A’s value is updated to 4000 in database and then T2 writes the value from its buffer back to
database. A’s value is updated to 5500 which shows that the effect of debit transaction is lost and
database has become inconsistent.
• To maintain consistency of database, we need concurrency control protocols which will be
discussed in next article. The operations of T1 and T2 with their buffers and database have been
shown in Table 1.
What is a Schedule?
A schedule is a series of operations from one or more transactions. A schedule can be of two
types:
• Serial Schedule: When one transaction completely executes before starting another
transaction, the schedule is called serial schedule. A serial schedule is always consistent. e.g.;
If a schedule S has debit transaction T1 and credit transaction T2, possible serial schedules
are T1 followed by T2 (T1->T2) or T2 followed by T1 ((T2->T1).
• Concurrent Schedule: When operations of a transaction are interleaved with operations of
other transactions of a schedule, the schedule is called Concurrent schedule. e.g.; Schedule of
debit and credit transaction shown in Table 1 is concurrent in nature. But concurrency can
lead to inconsistency in the database. The above example of a concurrent schedule is also
inconsistent.
Question: Consider the following transaction involving two bank accounts x and y:
1. read(x);
2. x := x – 50;
3. write(x);
4. read(y);
5. y := y + 50;
6. write(y);
The constraint that the sum of the accounts x and y should remain constant is that of?
85
1. Atomicity
2. Consistency
3. Isolation
4. Durability
Lock Based Concurrency Control Protocol in DBMS
A Database may provide a mechanism that ensures that the schedules are either conflict or
view serializable and recoverable (also preferably cascadeless). Testing for a schedule for
Serializability after it has executed is obviously too late!
So we need Concurrency Control Protocols that ensures Serializability .
Concurrency-control protocols:
• Allow concurrent schedules, but ensure that the schedules are conflict/view serializable, and
are recoverable and maybe even cascade less.
• These protocols do not examine the precedence graph as it is being created, instead a
protocol imposes a discipline that avoids non-seralizable schedules.
• Different concurrency control protocols provide different advantages between the amount of
concurrency they allow and the amount of overhead that they impose.
• We’ll be learning some protocols which are important for GATE CS. Questions from this
topic is frequently asked and it’s recommended to learn this concept.
Different categories of protocols:
• Lock Based Protocol
▪ Basic 2-PL
▪ Conservative 2-PL
▪ Strict 2-PL
▪ Rigorous 2-PL
• Graph Based Protocol
• Time-Stamp Ordering Protocol
• Multi-version Protocol
• A lock is a variable associated with a data item that describes a status of data item with
respect to possible operation that can be applied to it.
• They synchronize the access by concurrent transactions to the database items.
• It is required in this protocol that all the data items must be accessed in a mutually exclusive
manner.
• Let me introduce you to two common locks which are used and some terminology followed
in this protocol.
• A lock is a mechanism to control concurrent access to a data item
• Data items can be locked in two modes :
• Exclusive (X) mode. Data item can be both read as well as
• Written. X-lock is requested using lock-X instruction.
• Shared (S) mode. Data item can only be read. S-lock is
• Requested using lock-S instruction.
• Lock requests are made to the concurrency-control manager by the programmer. Transaction
can proceed only after request is granted.
86
Lock-compatibility matrix
• A transaction may be granted a lock on an item if the requested lock is compatible with locks
already held on the item by other transactions
• Any number of transactions can hold shared locks on an item,
o But if any transaction holds an exclusive on the item no other transaction may hold
any lock on the item.
o
• If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
A locking protocol is a set of rules followed by all transactions while requesting and releasing
locks. Locking protocols restrict the set of possible schedules
• This protocol ensures conflict-serializable schedules.
• Phase 1: Growing Phase
▪ Transaction may obtain locks
▪ Transaction may not release locks
• Phase 2: Shrinking Phase
87
•
• Two-phase locking with lock conversions:
• – First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
• – Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
• This protocol assures serializability. But still relies on the programmer to insert the various
locking instructions.
1. Shared Lock (S): also known as Read-only lock. As the name suggests it can be shared between
transactions because while holding this lock the transaction does not have the permission to
update data on the data item. S-lock is requested using lock-S instruction.
2. Exclusive Lock (X): Data item can be both read as well as written.This is Exclusive and cannot
be held simultaneously on the same data item. X-lock is requested using lock-X instruction.
3. Upgrade / Downgrade locks : A transaction that holds a lock on an item A is allowed under
certain condition to change the lock state from one state to another.
Upgrade: A S(A) can be upgraded to X(A) if Ti is the only transaction holding the S-lock on
element A.
Downgrade: We may downgrade X(A) to S(A) when we feel that we no longer want to write on
data-item A. As we were holding X-lock on A, we need not check any conditions.
Shortly we’ll use 2-Phase Locking (2-PL) which will use the concept of Locks to avoid
deadlock. So, applying simple locking, we may not always produce Serializable results, it may
lead to Deadlock Inconsistency.
4. Deadlock – consider the above execution phase. Now, T1 holds an Exclusive lock over B,
and T2 holds a Shared lock over A. Consider Statement 7, T2 requests for lock on B, while in
88
Statement 8 T1 requests lock on A. This as you may notice imposes a Deadlock as none can
proceed with their execution.
5. Starvation – is also possible if concurrency control manager is badly designed. For example: A
transaction may be waiting for an X-lock on an item, while a sequence of other transactions
request and are granted an S-lock on the same item. This may be avoided if the concurrency
control manager is properly designed.
Problem With Simple Locking…: Consider the Partial Schedule:
T1 T2
1 lock-X(B)
2 read(B)
3 B:=B-50
4 write(B)
5 lock-S(A)
6 read(A)
7 lock-S(B)
8 lock-X(A)
9 …… ……
Deadlocks
A deadlock is a condition where two or more transactions are waiting indefinitely for one
another to give up locks. Deadlock is said to be one of the most feared complications in DBMS as
no task ever gets finished and is in waiting state forever.
For example:
In the student table, transaction T1 holds a lock on some rows and needs to update some rows
in the grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and
needs to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and
similarly, transaction T2 is waiting for T1 to release its lock.
89
All activities come to a halt state and remain at a standstill. It will remain in a standstill until
the DBMS detects the deadlock and aborts one of the transactions.
• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release
its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
• Such a situation is called a deadlock.
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than
aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this method is
suitable only for the smaller database. For the larger database, deadlock prevention method
can be used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should
detect whether the transaction is involved in a deadlock or not. The lock manager maintains a Wait
for the graph to detect the deadlock cycle in the database.
• Two-phase locking does not ensure freedom from deadlocks.
90
• Some transaction will have to rolled back (made a victim) to break deadlock. Select that
transaction as victim that will incur minimum cost.
o Rollback -- determine how far to roll back transaction
o Total rollback: Abort the transaction and then restart it.
• More effective to roll back transaction only as far as necessary to break deadlock.
• Starvation happens if same transaction is always chosen as victim. Include the number of
rollbacks in the cost factor to avoid starvation.
• The timestamp-ordering protocol guarantees serializability since all the arcs in the
precedence graph are of the form:
o Further, any transaction that has read a data item written by Tj must abort
o This can lead to cascading rollback --- that is, a chain of rollbacks
• Solution 1:
o A transaction is structured such that its writes are all performed at the end of its
processing
o All writes of a transaction form an atomic action; no transaction may execute while a
transaction is being written
o A transaction that aborts is restarted with a new timestamp
• Solution 2: Limited form of locking: wait for data to be committed before reading it
• Solution 3: Use commit dependencies to ensure recoverability
Thomas’ Write Rule
• Modified version of the timestamp-ordering protocol in which obsolete write operations
may be ignored under certain circumstances.
• When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is attempting to
write an obsolete value of {Q}.
o Rather than rolling back Ti as the timestamp ordering protocol would have done, this
{write} operation can be ignored.
• Otherwise this protocol is the same as the timestamp ordering protocol.
• Thomas' Write Rule allows greater potential concurrency.
o Allows some view-serializable schedules that are not conflict-serializable.
Validation-Based Protocol
• Execution of transaction Ti is done in three phases.
o Read and execution phase: Transaction Ti writes only to
▪ temporary local variables
o Validation phase: Transaction Ti performs a ''validation test''
▪ to determine if local variables can be written without violating
▪ serializability.
o Write phase: If Ti is validated, the updates are applied to the
▪ database; otherwise, Ti is rolled back.
• The three phases of concurrently executing transactions can be interleaved, but each
transaction must go through the three phases in that order.
o Assume for simplicity that the validation and write phase occur together, atomically
and serially
▪ I.e., only one transaction executes validation/write at a time.
• Also called as optimistic concurrency control since transaction executes fully in the hope
that all will go well during validation
• Each transaction Ti has 3 timestamps
o Start(Ti) : the time when Ti started its execution
o Validation(Ti): the time when Ti entered its validation phase
o Finish(Ti) : the time when Ti finished its write phase
94
• Each data item Q has a sequence of versions <Q1, Q2,...., Qm>. Each version Qk contains three
data fields:
• Content -- the value of version Qk.
• W-timestamp(Qk) -- timestamp of the transaction that created (wrote) version Qk
• R-timestamp(Qk) -- largest timestamp of a transaction that successfully read version
Qk
• When a transaction Ti creates a new version Qk of Q, Qk's W-timestamp and R-timestamp are
initialized to TS(Ti).
• R-timestamp of Qk is updated whenever a transaction Tj reads Qk, and TS(Tj) > R-
timestamp(Qk).
• Suppose that transaction Ti issues a read(Q) or write(Q) operation. Let Qk denote the
version of Q whose write timestamp is the largest write timestamp less than or equal to
TS(Ti).
• If transaction Ti issues a read(Q), then the value returned is the content of version Qk.
• If transaction Ti issues a write(Q)
• if TS(Ti) < R-timestamp(Qk), then transaction Ti is rolled back.
• if TS(Ti) = W-timestamp(Qk), the contents of Qk are overwritten
• else a new version of Q is created.
• Observe that
• Reads always succeed
• A write by Ti is rejected if some other transaction Tj that (in the serialization order
defined by the timestamp values) should read
Ti's write, has already read a version created by a transaction older than Ti.
• Protocol guarantees serializability
• Differentiates between read-only transactions and update transactions
• Update transactions acquire read and write locks, and hold all locks up to the end of the
transaction. That is, update transactions follow rigorous two-phase locking.
• Each successful write results in the creation of a new version of the data item written.
• Each version of a data item has a single timestamp whose value is obtained from a
counter ts-counter that is incremented during commit processing.
• Read-only transactions are assigned a timestamp by reading the current value of ts-counter
before they start execution;
They follow the multiversion timestamp-ordering protocol for performing reads.
• When an update transaction wants to read a data item:
• it obtains a shared lock on it, and reads the latest version.
• When it wants to write an item
• it obtains X lock on; it then creates a new version of the item and sets this version's
timestamp to .
• When update transaction Ti completes, commit processing occurs:
• Ti sets timestamp on the versions it has created to ts-counter + 1
• Ti increments ts-counter by 1
• Read-only transactions that start after Ti increments ts-counter will see the values updated
by Ti.
• Read-only transactions that start before Ti increments the ts-counter will see the value before
the updates by Ti.
96
UNIT-5
Disk storage basic file structures and hashing
Disk Storage Devices
Databases are stored as files of records stored on disks - Physical database file structures –
Physical levels of three schema architecture
The collection of data in a DB must be stored on some storage medium.
The DBMS software can retrieve, update, and process this data as needed -
◼ Preferred secondary storage device for high storage capacity and low cost.
◼ Data stored as magnetized areas on magnetic disk surfaces.
◼ A disk pack contains several magnetic disks connected to a rotating spindle.
◼ Disks are divided into concentric circular tracks on each disk surface.
Track capacities vary typically from 4 to 50 Kbytes or more
◼ A track is divided into smaller blocks or sectors
◼ because it usually contains a large amount of information
◼ The division of a track into sectors is hard-coded on the disk surface and cannot be changed.
◼ One type of sector organization calls a portion of a track that subtends a fixed angle at the
center as a sector.
◼ A track is divided into blocks.
◼ The block size B is fixed for each system.
◼ Typical block sizes range from B=512 bytes to B=4096 bytes.
◼ Whole blocks are transferred between disk and main memory for processing.
◼ Storage media forms a hierarchy
◼ A read-write head moves to the track that contains the block to be transferred.
◼ Disk rotation moves the block under the read-write head for reading or writing.
◼ A physical disk block (hardware) address consists of:
◼ a cylinder number (imaginary collection of tracks of same radius from all recorded surfaces)
◼ the track number or surface number (within the cylinder)
◼ and block number (within track).
98
◼ Reading or writing a disk block is time consuming because of the seek time s and rotational delay
(latency) rd.
◼ Double buffering can be used to speed up the transfer of contiguous disk blocks.
Blocking
◼ Blocking:
◼ Refers to storing a number of records in one block on the disk.
◼ Blocking factor (bfr) refers to the number of records per block.
◼ There may be empty space in a block if an integral number of records do not fit in one block.
◼ Spanned Records:
Refers to records that exceed the size of one or more blocks and hence span a number of blocks
Files of Records
◼ A file is a sequence of records, where each record is a collection of data values (or data items).
◼ A file descriptor (or file header) includes information that describes the file, such as the field names
and their data types, and the addresses of the file blocks on disk.
◼ Records are stored on disk blocks.
◼ The blocking factor bfr for a file is the (average) number of file records stored in a disk block.
◼ A file can have fixed-length records or variable-length records.
◼ File records can be unspanned or spanned
◼ Unspanned: no record can span two blocks
◼ Spanned: a record can be stored in more than one block.
◼ The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or
indexed.
(blocks on Hard Disk → Physical block;
blocks on memory → logical blocks).
◼ In a file of fixed-length records, all records have the same format.
◼ Usually, unspanned blocking is used with such files.
◼ Files of variable-length records require additional information to be stored in each record, such as
separator characters and field types.
◼ Usually spanned blocking is used with such files.
Operation on Files
◼ Typical file operations include:
◼ OPEN: Readies the file for access, and associates a pointer that will refer to a current
file record at each point in time.
◼ FIND: Searches for the first file record that satisfies a certain condition, and makes it
the current file record.
◼ FINDNEXT: Searches for the next file record (from the current record) that satisfies a
certain condition, and makes it the current file record.
◼ READ: Reads the current file record into a program variable.
◼ INSERT: Inserts a new record into the file & makes it the current file record.
◼ DELETE: Removes the current file record from the file, usually by marking the
record to indicate that it is no longer valid.
◼ MODIFY: Changes the values of some fields of the current file record.
◼ CLOSE: Terminates access to the file.
◼ REORGANIZE: Reorganizes the file records.
◼ For example, the records marked deleted are physically removed from the file
or a new organization of the file records is created.
◼ READ_ORDERED: Read the file blocks in order of a specific field of the file.
99
Unordered Files
◼ Also called a heap or a pile file.
◼ New records are inserted at the end of the file.
◼ A linear search through the file records is necessary to search for a record.
◼ This requires reading and searching half the file blocks on the average, and is hence
quite expensive.
◼ Record insertion is quite efficient (its advantage).
Reading the records in order of a particular field requires sorting the file records
Unordered Files
1. Insert (a new record): Very efficient!
last block
on the
disk
Copy
Buffer
Add
New
record
Re-write
Block back
to
the disk
update
Delete a record:
Find block
on the
disk
Copy
Buffer
Delete
Del
record
Re-write
Block back
to
the disk
update
Address of the
file
block kept in the
‘File header’
◼ The following table shows the average access time to access a specific
record for a given type of file
Hashing Techniques
◼ Def: another type of primary file organization based on hashing which provide very fast
access to records under certain search conditions. This organization is usually called a hash
file.
◼ The search condition must be an equality on a single field called hash filed. In most cases, the
has filed is also a key field of the file, in which case it is called the hash key.
The idea behind hashing is to provide a function h, called a hash function or randomizing
function, which is applied to the hash field value of a record and yields the address of the disk block
in which the record is sorted.
Hashing Techniques Types
1. Internal Hashing (inside)
◼ For internal files, hashing is implemented as a hash table through the use of an array of
records. Assume that the array index range is from [ 0 to M – 1, where M is the number of
blocks ].
◼ The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ...,
bucketM-1.
◼ Typically, a bucket corresponds to one (or a fixed number of) disk block
◼ One of the file fields is designated to be the hash key of the file.
◼ The record with hash key value K is stored in bucket i, where i=h(K), where h is the hashing
function.
◼ Hash function 1: h(k)= k mod M → returns the remainder of an integer hash filed
value K after division by M; this value then used for the record address.
◼ Function 2 (folding): involving applying an arithmetic function (i.e. addition) or a
logic function (i.e. exclusive) to differentiate portions of the hash field value to
compute the hash address
◼ Search is very efficient on the hash key.
104
Hashed Files
◼ Collisions occur when a new record hashes to a bucket that is already full.
◼ An overflow file is kept for sorting such records.
◼ Overflow records that hash to each bucket can be linked together.
The process of finding another position is called collision resolution
◼ There are numerous methods for collision resolution, including the following:
◼ Open addressing: Proceeding from the occupied position specified by the hash
address, the program checks the subsequent positions in order until an unused (empty)
position is found.
◼ Chaining: For this method, various overflow locations are kept, usually by extending
the array with a number of overflow positions. In addition, a pointer field is added to
each record location. A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash address location to the
address of that overflow location.
105
Multiple hashing: The program applies a second hash function if the first results in a collision.
If another collision results, the program uses open addressing or applies a third hash function and
then uses open addressing if necessary
◼
108
Extendible hashing
◼ In extendible hashing, stores an access structure in addition to the file, and hence is
somewhat similar to indexing.
◼ The main differences is that the access structure is based on the values that result after
application of the hash function to the search field. In indexing, the access structure is based
on the values of the search field itself.
◼ A type of directory an array 2 bucket addresses is maintained, where d is called the global
depth of the directory; đ is a local depth.
◼ The integer value corresponding to the first (high-order) d bits of a hash value is used
as an index to the array to determine a directory entry, and the address in that entry
determines the bucket in which the corresponding records are sorted.
◼ A local depth ɗ, stored with each bucket specifies the number of bits on which the
bucket contents are based.
109
Linear hashing
◼ The idea behind it is to allow a hash file to expand and shrink its number of buckets
dynamically without needing to a directory.
Parallelizing Disk Access using RAID Technology
◼ Secondary storage technology must take steps to keep up in performance and reliability with
processor technology.
◼ A major advance in secondary storage technology is represented by the development of
RAID, which originally stood for Redundant Arrays of Inexpensive Disks.
The main goal of RAID is to even out the widely different rates of performance improvement
of disks against those in memory and microprocessors.
RAID Technology
◼ A natural solution is a large array of small independent disks acting as a single higher-
performance logical disk.
◼ A concept called data striping is used, which utilizes parallelism to improve disk
performance.
◼ Data striping distributes data transparently over multiple disks to make them appear as a
single large, fast disk.
◼
◼ Different raid organizations were defined based on different combinations of the two factors
of granularity of data interleaving (striping) and pattern used to compute redundant
information.
◼ Raid level 0 (uses striping/distribution) has no redundant data and hence has the best
write performance at the risk of data loss
◼ Raid level 1 uses mirrored disks.
◼ Raid level 2 uses memory-style redundancy by using Hamming codes, which contain
parity bits for distinct overlapping subsets of components. Level 2 includes both error
detection and correction.
◼ Raid level 3 uses a single parity disk relying on the disk controller to figure out which
disk has failed.
◼ Raid Levels 4 and 5 use block-level data striping, with level 5 distributing data and
parity information across all disks.
110
◼ The demand for higher storage has risen considerably in recent times.
◼ Organizations have a need to move from a static fixed data center oriented operation to a
more flexible and dynamic infrastructure for information processing.
◼ Thus they are moving to a concept of Storage Area Networks (SANs).
◼ In a SAN, online storage peripherals are configured as nodes on a high-speed network
and can be attached and detached from servers in a very flexible manner.
◼ This allows storage systems to be placed at longer distances from the servers and provide
different performance and connectivity options.
111
DENSE SPARSE
o indexing in database is defined based on its indexing attributes.
PRIMARY INDEX :-
primary index is an ordered file which is fixed length size with two fields.
The first field is same as a primary key and second field is pointed to that specific data
block.
The primary indexing in DBMS is also divided into two types
1. DENSE INDEX
2. SPARSE INDEX
DENSE INDEXING :-
• In dense index, a record is created for every search key valued in the
database.
• This helps to search faster.
• But needs more space to store index records.
SPARSE INDEXING:-
112
• It is an index record that appears for only some of the values in the
file.
• Sparse index helps you to resolve the issues of dense indexing in
DBMS.
• However, sparse index stores index records for only some search_key
values.
• It needs less space, less maintenance over level for insertion and
deletion.
• But slower compared to the dense index for locating records.
• The sparse index maintains ordered data indexing, we can’t
maintain.
SECONDARY INDEX :-
The secondary index in DBMS can be generated by a field which has
unique value for each record, and it should be a candidate key.
It is also known as non-clustering index.
This is two level database indexing technique, that reduces the
mapping size of first level.
As the size of table grows, the size of mapping also grows.
If mapping size grows then fetching the address itself becomes
slower.
In this case sparse, index will not be efficient
To overcome the problem, secondary index is introduced
CLUSTER INDEX:-
A clustered index can be defined as an ordered data file.
Some times index is created on non-primary key columns which may
not be unique for each record.
In this case, to identify the record faster we will group 2 or more
columns to get the unique value & create index out of them .
This method is called clustering.
This schema little confusing because one disk block is shared by records
which belongs to disk clusters.
If we use separate disk block for separate clusters, then it is called better
technique.
MULTILEVEL INDEX:-
Multi-level indexing in database Is created when a primary index does not fit in
memory.
• If single-level, index is used, then a large size index can’t be kept in memory which leads to
multiple disk accesses
• Multi-level index helps in breaking down the index into several smaller indices in order to
make the make the outermost level so small that it can be saved in a single disk block, which
can easily be accommodated anywhere in the main memory
B+ TREE:-
Is balanced BST that follows a multi-level index format.
• B+ tree can support random access & sequential access.
• B+ tree leaf nodes are actual data pointer.
• B+ tree ensures that all leaf nodes are linked using link list, and all leaf nodes remain at same
height, thus balanced.