CBD 04 Normalisation
CBD 04 Normalisation
Données
❑ We assume that :
▪ a set of functional dependencies is given for each relation, and
▪ each relation has a designated primary key;
this information combined with the tests (conditions) for normal forms
drives the normalization process for relational schema design.
2
Normal Forms Based on Primary Keys
❑ Evaluate each relation for goodness and decompose it further as
needed to achieve higher normal forms using the normalization
theory
3
Normalization of Relations
❑ The normalization process takes a relation schema through a series of tests to certify whether it
satisfies a certain normal form.
❑ The process, which proceeds in a top-down fashion by evaluating each relation against the criteria
for normal forms and decomposing relations as necessary, can thus be considered as relational
design by analysis.
4
Normalization of data
❑ process of analyzing the given relation schemas based on their FDs
and primary keys to achieve the desirable properties of
▪ (1) minimizing redundancy and
▪ (2) minimizing the insertion, deletion, and update anomalies
❑ a “filtering” or “purification” process to make the design have
successively better quality.
▪ An unsatisfactory relation schema that does not meet the condition for a normal form—the
normal form test—is decomposed into smaller relation schemas that contain a subset of the
attributes and meet the test that was otherwise not met by the original relation
5
Normalization procedure
❑ provides database designers with :
▪ A formal framework for analyzing relation schemas based on their keys and on the functional
dependencies among their attributes
▪ A series of normal form tests that can be carried out on individual relation schemas so that
the relational database can be normalized to any desired degree
6
Normalization procedure
❑ Normal forms, when considered in isolation from other factors, do not guarantee
a good database design.
❑ It is generally not sufficient to check separately that each relation schema in the
database is, say, in BCNF or 3NF. Rather, the process of normalization through
decomposition must also include two properties
▪ The nonadditive join or lossless join property, which guarantees that the spurious tuple generation
problem does not occur with respect to the relation schemas created after decomposition (mandatory).
▪ The dependency preservation property, which ensures that each functional dependency is represented
in some individual relation resulting after decomposition (can sometimes be sacrified)
7
Practical Use of Normal Forms
❑ Most practical design projects in commercial and governmental environment
acquire existing designs of databases from previous designs, from designs in
legacy models, or from existing files. They are certainly interested in assuring that
the designs are good quality and sustainable over long periods of time.
❑ Existing designs are evaluated by applying the tests for normal forms, and
normalization is carried out in practice so that the resulting designs are of high
quality and meet the desirable properties stated previously.
8
Practical Use of Normal Forms
❑ Although several higher normal forms have been defined, such as the 4NF and 5NF, the practical
utility of these normal forms becomes questionable. The reason is that the constraints on which
they are based are rare and hard for the database designers and users to understand or to detect.
Designers and users must either already know them or discover them as a part of the business.
Thus, database design as practiced in industry today pays particular attention to normalization
only up to 3NF, BCNF, or at most 4NF.
❑ Another point worth noting is that the database designers need not normalize to the highest
possible normal form. Relations may be left in a lower normalization status, such as 2NF, for
performance reasons
▪ Doing so incurs the corresponding penalties of dealing with the anomalies.
▪ Denormalization is the process of storing the join of higher normal form relations as a base relation, which is in a
lower normal form.
9
Keys and Attributes Participating in Keys
❑ The difference between a key and a superkey is that a key has to be minimal; that is, if we have a key
K = {A1, A2, … , Ak} of R, then K - {Ai} is not a key of R for any Ai, 1 <= i <= k.
❑ {Ssn} is a key for EMPLOYEE, whereas {Ssn}, {Ssn, Ename}, {Ssn, Ename, Bdate}, and any set of attributes that includes
Ssn are all superkeys.
10
Keys and Attributes Participating in Keys
❑ If a relation schema has more than one key, each is called a candidate key.
❑ One of the candidate keys is arbitrarily designated to be the primary key, and the others are called
secondary keys.
❑ In a practical relational database, each relation schema must have a primary key.
❑ If no candidate key is known for a relation, the entire relation can be treated as a default superkey.
❑ {Ssn} is the only candidate key for EMPLOYEE, so it is also the primary key.
11
First Normal Form
❑ It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the
value of any attribute in a tuple must be a single value from the domain of that attribute. Hence, 1NF
disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a single
tuple.
12
First Normal Form
❑ There are three main techniques to achieve first normal form for such a relation
▪ Remove the attribute Dlocations that violates 1NF and place it in a separate relation DEPT_LOCATIONS along with
the primary key Dnumber of DEPARTMENT.
▪ The primary key of this newly formed relation is the combination {Dnumber, Dlocation}.
▪ A distinct tuple in DEPT_LOCATIONS exists for each location of a department.
▪ This decomposes the non-1NF relation into two 1NF relations.
13
First Normal Form
❑ Expand the key so that there will be a separate tuple in the original DEPARTMENT relation for each location
of a DEPARTMENT,
14
First Normal Form
❑ If a maximum number of values is known for the attribute—for example, if it is known that at most three locations
can exist for a department—replace the Dlocations attribute by three atomic attributes: Dlocation1, Dlocation2,
and Dlocation3.
❑ This solution has the disadvantage of introducing NULL values if most departments have fewer than three
locations. It further introduces spurious semantics about the ordering among the location values; that ordering is
not originally intended.
15
Second Normal Form
❑ Second normal form (2NF) is based on the concept of full functional dependency.
❑ A functional dependency X -> Y is a full functional dependency if removal of any attribute A from X means
that the dependency does not hold anymore.
❑ The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of the
primary key.
❑ If the primary key contains a single attribute, the test need not be applied at all
16
Second Normal Form
17
Third Normal Form
❑ Third normal form (3NF) is based on the concept of transitive dependency.
❑ A functional dependency X -> Y in a relation schema R is a transitive dependency if there exists a set of
attributes Z in R that is neither a candidate key nor a subset of any key of R and both X -> Z and Z -> Y hold.
❑ The dependency Ssn -> Dmgr_ssn is transitive through Dnumber in EMP_DEPT because both the
dependencies Ssn -> Dnumber and Dnumber -> Dmgr_ssn hold and Dnumber is neither a key itself nor a
subset of the key of EMP_DEPT.
❑ Intuitively, we can see that the dependency of Dmgr_ssn on Dnumber is undesirable in EMP_DEPT since
Dnumber is not a key of EMP_DEPT.
18
Summary of 3 NFs
19
Decomposition Process
❑ Figure (a), in the next slide, describes parcels of land for sale in various counties
of a state.
❑ Suppose that there are two candidate keys: Property_id# and {County_name,
Lot#};
▪ that is, lot numbers are unique only within each county,
▪ but Property_id# numbers are unique across counties for the entire state.
20
Decomposition Process
21
Decomposition Process
❑ Based on the two candidate keys Property_id# and {County_name, Lot#}, the
functional dependencies FD1 and FD2 in Figure 14.12(a) hold.
❑ Suppose that the following two additional functional dependencies hold in LOTS:
▪ FD3 says that the tax rate is fixed for a given county (does not vary lot by lot within the same county),
▪ FD4 says that the price of a lot is determined by its area regardless of which county it is in.
22
Decomposition Process
❑ The LOTS relation schema violates the general definition of 2NF because Tax_rate is partially
dependent on the candidate key {County_name, Lot#}, due to FD3.
❑ To normalize LOTS into 2NF, we decompose it into the two relations LOTS1 and LOTS2, shown in
Figure 14.12(b).
❑ We construct LOTS1 by removing the attribute Tax_rate that violates 2NF from LOTS and placing it
with County_name (the left-hand side of FD3 that causes the partial dependency) into another
relation LOTS2.
▪ Both LOTS1 and LOTS2 are in 2NF.
▪ Notice that FD4 does not violate 2NF and is carried over to LOTS1.
23
Decomposition Process
❑ LOTS2 (Figure 14.12(b)) is in 3NF.
❑ However, FD4 in LOTS1 violates 3NF because Area is not a superkey and Price is not
a prime attribute in LOTS1.
❑ To normalize LOTS1 into 3NF, we decompose it into the relation schemas LOTS1A and LOTS1B shown in
Figure 14.12(c).
❑ We construct LOTS1A by removing the attribute Price that violates 3NF from LOTS1 and placing it with
Area into another relation LOTS1B.
24
Boyce-Codd Normal Form
❑ Suppose that we have thousands of lots in the relation but the lots are from only two counties: DeKalb and
Fulton.
❑ Suppose also that lot sizes in DeKalb County are only 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0 acres, whereas lot sizes in
Fulton County are restricted to 1.1, 1.2, … , 1.9, and 2.0 acres
▪ FD5: Area -> County_name.
▪ LOTS1A still is in 3NF
❑ County_name being a prime attribute.
25
Boyce-Codd Normal Form
❑ In practice, most relation schemas that are in 3NF are also in BCNF.
▪ Only if there exists some f.d. X-> A that holds in a relation schema R
❑ with X not being a superkey
❑ and A being a prime attribute
❑ Ideally, relational database design should strive to achieve BCNF or 3NF for every relation
schema.
❑ Achieving the normalization status of just 1NF or 2NF is not considered adequate, since
both were developed historically to be intermediate normal forms as stepping stones to
3NF and BCNF.
26
Boyce-Codd Normal Form
❑ Let TEACH be a relation with the following dependencies:
▪ DF2 means that each instructor teaches one course is a constraint for this application.
▪ Note that {Student, Course} is a candidate key for this relation
▪ Student : A
▪ Course as B,
▪ Instructor as C
27
Boyce-Codd Normal Form
❑ Which of the above three is a desirable decomposition?
❑ We are not able to meet the functional dependency preservation for any of the above BCNF decompositions
as seen above; but we must meet the nonadditive join property. A simple test comes in handy to test the
binary decomposition of a relation into two relations:
❑ If we apply this test to the above three decompositions, we find that only the third decomposition meets the
test. In the third decomposition, the R1 R2 for the above test is Instructor and R1 - R2 is Course. Because
Instructor -> Course, the NJB test is satisfied and the decomposition is nonadditive.
❑ Hence, the proper decomposition of TEACH into BCNF relations is:
❑ TEACH1 (Instructor, Course) and TEACH2 (Instructor, Student)
28
Boyce-Codd Normal Form
❑ We make sure that we meet this property, because nonadditive decomposition is a must during
normalization.
❑ In general, a relation R not in BCNF can be decomposed so as to meet the nonadditive join property by the
following procedure.16 It decomposes R successively into a set of relations that are in BCNF:
29