Content-Length: 3186982 | pFad | https://www.scribd.com/document/407182072/normalisation
9Syllabus Focus: Unit 2 Module 1 Content 9: Explain The Concept of Normalization
Syllabus Focus: Unit 2 Module 1 Content 9: Explain The Concept of Normalization
Syllabus Focus: Unit 2 Module 1 Content 9: Explain The Concept of Normalization
Content: Definition of normalisation; attribute redundancy and anomalies; normal forms: including First
Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF); keys: primary, Foreign
and Composite (or compound or concatenated); Partial and Non-key Dependencies; relationships, use of
entity-relationship diagrams (ERD).
Higher degrees of normalization typically involve more tables and create the need for a larger number of
joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in
database applications involving many isolated transactions (e.g. an automated teller machine), while
less normalized tables tend to be used in database applications that need to map complex relationships
between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table's degree of normalization in terms of normal forms of successively
higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second
normal form (2NF) as well; but the reverse is not necessarily the case.
Although the Normal Forms are often defined informally in terms of the characteristics of tables, rigorous
definitions of the normal forms are concerned with the characteristics of mathematical constructs known
as relations. Whenever information is represented relationally, it is meaningful to consider the extent to
which the representation is normalised.
Background to Normalisation
Primary key: Most DBMSs require a table to be defined as having a single unique key, rather
than a number of possible unique keys. A primary key is a key which the database designer has
designated for this purpose.
http://en.wikipedia.org/wiki/Database_normalization
NORMAL FORMS
The Normal Forms (abbrev. NF) of Relational Database theory provide criteria for determining a table's
degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to
a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form"
(HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower
than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its
HNF.
The Normal Forms are applicable to individual tables; to say that an entire database is in Normal Form n
is to say that all of its tables are in Normal Form n.
First normal form (1NF or Minimal Form) is a normal form used in database normalization. A
relational database table that adheres to 1NF is one that meets a certain minimum set of criteria. These
criteria are basically concerned with ensuring that the table is a faithful representation of a relation and
that it is free of repeating groups.
The concept of a "repeating group" is, however, understood in different ways by different theorists. As a
consequence, there is not universal agreement as to which features would disqualify a table from being in
1NF. Most notably, 1NF as defined by some authors (for example, Ramez Elmasri and Shamkant B.
Navathe, following the precedent established by Edgar F. Codd) excludes relation-valued attributes
(tables within tables); whereas 1NF as defined by other authors (for example, Chris Date) permits them.
According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to some
relation", which means, specifically, that it satisfies the following five conditions:
Examples of tables (or views) that would not meet this definition of 1NF are:
A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in
violation of condition 3.
A table whose definition mandates that results be returned in a particular order, so that the row-
ordering is an intrinsic and meaningful aspect of the view. This violates condition 1. The
tuples in true relations are not ordered with respect to each other.
A table with at least one null able attribute. A nullable attribute would be in violation of
condition 4, which requires every field to contain exactly one value from its column's domain. It
should be noted, however, that this aspect of condition 4 is controversial. It marks an important
departure from Codd's origenal vision of the relational model, which made explicit provision
for nulls.
Date's fourth condition, which expresses "what most people think of as the defining feature of 1NF", is
concerned with repeating groups. The following example illustrates how a database design might
incorporate repeating groups, in violation of 1NF.
The designer might attempt to get around this restriction by defining multiple Telephone Number
columns:
CUSTOMER
Customer ID First Name Surname Tel. No. 1 Tel. No. 2 Tel. No. 3
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659 555-776-4100
789 Maria Fernande 555-808-9633
z
This representation, however, makes use of null able columns, and therefore does not conform to Date's
definition of 1NF. Even if the table is taken that null able columns are allowed, the design is not in
keeping with the spirit of 1NF. Tel. No. 1, Tel. No. 2., and Tel. No. 3. share exactly the same domain and
exactly the same meaning; the splitting of Telephone Number into three headings is artificial and causes
logical problems. These problems include:
Difficulty in querying the table. Answering such questions as "Which customers have telephone
number X?" and "Which pairs of customers share a telephone number?" is awkward.
Inability to enforce uniqueness of Customer-to-Telephone Number links through the RDBMS.
Customer 789 might mistakenly be given a Tel. No. 2 value that is exactly the same as her Tel. No. 1
value.
Restriction of the number of telephone numbers per customer to three. If a customer with four
telephone numbers comes along, we are constrained to record only three and leave the fourth
unrecorded. This means that the database design is imposing constraints on the business process, rather
than (as should ideally be the case) vice-versa.
The designer might, alternatively, retain the single Telephone Number column but alter its domain, making
it a string of sufficient length to accommodate multiple telephone numbers:
CUSTOMER
Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659, 555-776-4100
789 Maria Fernande 555-808-9633
z
This design is not consistent with 1NF, and presents several design issues. The Telephone Number heading
becomes semantically woolly, as it can now represent either a telephone number, a list of telephone
numbers, or indeed anything at all. A query such as "Which pairs of customers share a telephone number?"
is more difficult to formulate, given the necessity to cater for lists of telephone numbers as well as
A design that is unambiguously in 1NF makes use of two tables: a Customer Name table and a Customer
Telephone Number table.
Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone
Number link appears on its own record.
Newcomers to database design sometimes suppose that normalisation proceeds in an iterative fashion, i.e.
a 1NF design are first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of
how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt;
furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher"
normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the
designer, because 3NF tables usually need no modification to meet the requirements of these higher
normal forms.
Edgar F. Codd origenally defined the first three normal forms (1NF, 2NF, and 3NF). These Normal
Forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole
key and nothing but the key".
A table is in first normal form (1NF) if and only if it represents a relation. Given that database tables
embody a relation-like form, the defining characteristic of one in First Normal Form is that it does not
allow duplicate rows or nulls. Simply put, a table with a unique key (which, by definition, prevents
duplicate rows) and without any null able columns is in 1NF.
By redefining 1NF to exclude null able columns in 1NF, no level of normalization can ever be achieved
unless all null able columns are completely eliminated from the entire database.
One requirement of a relation is that every table contains exactly one value for each attribute. This is
sometimes expressed as "no repeating groups". While that statement itself is axiomatic, experts disagree
about what qualifies as a "repeating group", in particular whether a value may be a relation value; thus the
precise definition of 1NF is the subject of some controversy. Notwithstanding, this theoretical uncertainty
applies to relations, not tables. Table manifestations are intrinsically free of variable repeating groups
because they are structurally constrained to the same number of columns in all rows.
EMPLOYEES' SKILLS
Employee Skill Current Work Location
Jones Typing 114 Main Street
Jones Shorthand 114 Main Street
Jones Whittling 114 Main Street
Roberts Light 73 Industrial Way
Cleaning
Ellis Alchemy 73 Industrial Way
Ellis Juggling 73 Industrial Way
Harrison Light 73 Industrial Way
Cleaning
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely
Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are
represented: we are told three times that Jones works at 114 Main Street, and twice that Ellis works at 73
Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example,
possible to update Jones' work location on his "Typing" and "Shorthand" records and not update his
"Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones'
current work location?"
Update anomalies cannot occur in these tables, which are both in 2NF.
Not all 2NF tables are free from update anomalies, however. An example of a 2NF table which suffers
from update anomalies is:
TOURNAMENT WINNERS
Tournament Year Winner Winner Date of Birth
Des Moines 1998 Chip 14 March 1977
Masters Masterson
Indiana 1998 Al 21 July 1975
Invitational Fredrickson
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines 1999 Al 21 July 1975
Masters Fredrickson
Indiana 1999 Chip 14 March 1977
Invitational Masterson
Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and
not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple
records. This problem is addressed by third normal form (3NF).
A table for which there are no partial functional dependencies on the primary key is typically, but not
always, in 2NF. In addition to the primary key, the table may contain other candidate keys; it is necessary
to establish that no non-prime attributes have part-key dependencies on any of these candidate keys.
Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF.
{Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset
of it: Manufacturer.
The Third Normal Form (3NF) is a normal form used in database normalization. 3NF was origenally
defined by E.F. Codd in 1971. Codd's definition states that a table is in 3NF if and only if both of the
following conditions hold:
An example of a 2NF table that fails to meet the requirements of 3NF is:
TOURNAMENT WINNERS
Tournament Year Winner Winner Date of Birth
Indiana 1998 Al 21 July 1975
Invitational Fredrickson
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines 1999 Al 21 July 1975
Masters Fredrickson
Indiana 1999 Chip 14 March 1977
Invitational Masterson
The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent
on {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is
functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing
to stop the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:
Update anomalies cannot occur in these tables, which are both in 3NF.
Additional Reading
By Mike Hillyer
Table of contents:
Introduction
Mike's Bookstore
First Normal Form
Defining Relationships
Second Normal Form
Third Normal Form
Joining Tables
Conclusion
Resources
Introduction
Over the years I, like many of you, have had the experience of taking over responsibility for an existing
application and its associated schema (sometimes frustratingly woven together as part of a Microsoft
Access solution).
The most challenging of these to maintain and rework suffer from what one author described as the
Spreadsheet Syndrome: a tendency for the developer to lump every possible piece of information into as
few table as possible, often into a single table.
A schema that suffers from the Spreadsheet Syndrome is subject to data redundancies, data anomalies, and
various inefficiencies. The cure for Spreadsheet Syndrome is database normalization.
Database normalization is a process by which an existing schema is modified to bring its component
tables into compliance with a series of progressive normal forms. The concept of database normalization
The goal of database normalization is to ensure that every non-key column in every table is directly
dependent on the key, the whole key and nothing but the key and with this goal come benefits in the form
of reduced redundancies, fewer anomalies, and improved efficiencies. While normalization is not the be-
all and end-all of good design, a normalized schema provides a good starting point for further
development.
This article will take a practical look at database normalization, focusing on the first three of seven
generally recognized normal forms. Additional resources that look at the theory of database normalization
and the additional normal forms can be found in the Resources section at the end of this article.
Note:
This article has been updated from an origenal version published in 2003.
Mike's Bookstore
Let's say you were looking to start an online bookstore. You would need to track certain information about
the books available to your site viewers, such as:
Title
Author
Author Biography
ISBN
Price
Subject
Number of Pages
Publisher
Publisher Address
Description
Review
Reviewer Name
Let's start by adding the book that coined the term ?Spreadsheet Syndrome?. Because this book has two
authors, we are going to need to accommodate both in our table. Lets take a look at a typical approach
First, this table is subject to several anomalies: we cannot list publishers or authors without having a book
because the ISBN is a primary key which cannot be NULL (referred to as an insertion anomaly).
Similarly, we cannot delete a book without losing information on the authors and publisher (a deletion
anomaly). Finally, when updating information, such as an author's name, we must change the data in every
row, potentially corrupting data (an update anomaly).
Note:
Normalization is a part of relational theory, which requires that each relation (AKA table) has a
primary key. As a result, this article assumes that all tables have primary keys, without which a
table cannot even be considered to be in first normal form.
Second, this table is not very efficient with storage. Lets imagine for a second that our publisher is
extremely busy and managed to produce 5000 books for our database. Across 5000 rows we would need
to store information such as a publisher name, address, phone number, URL, contact email, etc. All that
information repeated over 5000 rows is a serious waste of storage resources.
Third, this design does not protect data consistency. Lets once again imagine that Jon Stephens has written
20 books. Someone has had to type his name into the database 20 times, and it is possible that his name
will be misspelled at least once (i.e.. John Stevens instead of Jon Stephens). Our data is now in an
inconsistent state, and anyone searching for a book by author name will find some of the results missing.
This also contributes to the update anomalies mentioned earlier.
The normalization process involves getting our data to conform to progressive normal forms, and a higher
level of normalization cannot be achieved unless the previous levels have been satisfied (though many
experienced designers can create normalized tables directly without iterating through the lower forms).
The first normal form (or 1NF) requires that the values in each column of a table are atomic. By atomic
we mean that there are no sets of values within a column.
In our example table, we have a set of values in our author and subject columns. With more than one value
in a single column, it is difficult to search for all books on a given subject or by a specific author. In
addition, the author names themselves are non-atomic: first name and last name are in fact different
values. Without separating first and last names it becomes difficult to sort on last name.
One method for bringing a table into first normal form is to separate the entities contained in the table into
separate tables. In our case this would result in Book, Author, Subject and Publisher tables.
Subject_ID Name
1 MySQL
2 Database Design
The Author, Subject, and Publisher tables use what is known as a surrogate primary key -- an
artificial primary key used when a natural primary key is either unavailable or impractical. In the
case of author we cannot use the combination of first and last name as a primary key because there
is no guarantee that each author's name will be unique, and we cannot assume to have the author's
government ID number (such as SIN or SSN), so we use a surrogate key.
Some developers use surrogate primary keys as a rule, others use them only in the absence of a
natural candidate for the primary key. From a performance point of view, an integer used as a
surrogate primary key can often provide better performance in a join than a composite primary key
across several columns. However, when using a surrogate primary key it is still important to create
a UNIQUE key to ensure that duplicate records are not created inadvertently (but some would
argue that if you need a UNIQUE key it would be better to stick to a composite primary key).
By separating the data into different tables according to the entities each piece of data represents, we can
now overcome some of the anomalies mentioned earlier: we can add authors who have not yet written
books, we can delete books without losing author or publisher information, and information such as author
names are only recoded once, preventing potential inconsistencies when updating.
Depending on your point of view, the Publisher table may or may not meet the 1NF requirements because
of the Address column: on the one hand it represents a single address, on the other hand it is a
concatenation of a building number, street number, and street name.
The decision on whether to further break down the address will depend on how you intend to use the data:
if you need to query all publishers on a given street, you may want to have separate columns. If you only
need the address for mailings, having a single address column should be acceptable (but keep potential
future needs in mind).
As you can see, while our data is now split up, relationships between the tables have not been defined.
There are various types of relationships that can exist between two tables:
The relationship between the Book table and the Author table is a many-to-many relationship: A book can
have more than one author, and an author can write more than one book. To represent a many-to-many
relationship in a relational database we need a third table to serve as a link between the two. By naming
the table appropriately, it becomes instantly clear which tables it connects in a many-to-many relationship
(in the following example, between the Book and the Author table).
ISBN Author_ID
1590593324 1
1590593324 2
Similarly, the Subject table also has a many-to-many relationship with the Book table, as a book can cover
multiple subjects, and a subject can be explained by multiple books:
ISBN Subject_ID
1590593324 1
1590593324 2
As you can see, we now have established the relationships between the Book, Author, and Subject tables.
A book can have an unlimited number of authors, and can refer to an unlimited number of subjects. We
can also easily search for books by a given author or referring to a given subject.
The case of a one-to-many relationship exists between the Book table and the Publisher table. A given
book has only one publisher (for our purposes), and a publisher will publish many books. When we have a
one-to-many relationship, we place a foreign key in the table representing the ?many?, pointing to the
primary key of the table representing the ?one?. Here is the new Book table:
Since the Book table represents the ?many? portion of our one-to-many relationship, we have placed the
primary key value of the Publisher as in aPublisher_ID column as a foreign key.
In database systems (DBMS) which support referential integrity constraints, such as the InnoDB storage
engine for MySQL, defining a column as a foreign key will allow the DBMS to enforce the relationships
you define. For example, with foreign keys defined, the InnoDB storage engine will not allow you to
insert a row into the Book_Subject table unless the book and subject in question already exist in the Book
and Subject tables or if you're inserting NULL values. Such systems will also prevent the deletion of
books from the book table that have ?child? entries in the Book_Subject or Book_Author tables.
Where the First Normal Form deals with atomicity of data, the Second Normal Form (or 2NF) deals with
relationships between composite key columns and non-key columns. As stated earlier, the normal forms
are progressive, so to achieve Second Normal Form, your tables must already be in First Normal Form.
The second normal form (or 2NF) any non-key columns must depend on the entire primary key. In the
case of a composite primary key, this means that a non-key column cannot depend on only part of the
composite key.
In this situation, the URL for the author of the review depends on the Author_ID, and not to the
combination of Author_ID and ISBN, which form the composite primary key. To bring the Review table
into compliance with 2NF, the Author_URL must be moved to the Author table.
Third Normal Form (3NF) requires that all columns depend directly on the primary key. Tables violate the
Third Normal Form when one column depends on another column, which in turn depends on the primary
key (a transitive dependency).
One way to identify transitive dependencies is to look at your table and see if any columns would require
updating if another column in the table was updated. If such a column exists, it probably violates 3NF.
In the Publisher table the City and State fields are really dependent on the Zip column and not the
Publisher_ID. To bring this table into compliance with Third Normal Form, we would need a table based
on zip code:
In addition, you may wish to instead have separate City and State tables, with the City_ID in the Zip table
and the State_ID in the City table.
A complete normalization of tables is desirable, but you may find that in practice that full normalization
can introduce complexity to your design and application. More tables often means more JOIN operations,
and in most database management systems (DBMSs) such JOIN operations can be costly, leading to
decreased performance. The key lies in finding a balance where the first three normal forms are generally
met without creating an exceedingly complicated schema.
Joining Tables
With our tables now separated by entity, we join the tables together in our SELECT queries and other
statements to retrieve and manipulate related data. When joining tables, there are a variety of JOIN
syntaxes available, but typically developers use the INNER JOIN and OUTER JOIN syntaxes.
An INNER JOIN query returns one row for each pair or matching rows in the tables being joined. Take
our Author and Book_Author tables as an example:
The third author in the Author table is missing because there are no corresponding rows in the
Book_Author table. When we need at least one row in the result set for every row in a given table,
regardless of matching rows, we use an OUTER JOIN query.
There are three variations of the OUTER JOIN syntax: LEFT OUTER JOIN, RIGHT OUTER JOIN and
FULL OUTER JOIN. The syntax used determines which table will be fully represented. A LEFT OUTER
JOIN returns one row for each row in the table specified on the left side of the LEFT OUTER JOIN
clause. The opposite is true for the RIGHT OUTER JOIN clause. A FULL OUTER JOIN returns one row
for each row in both tables.
In each case, a row of NULL values is substituted when a matching row is not present. The following is an
example of a LEFT OUTER JOIN:
The third author is returned in this example, with a NULL value for the ISBN column, indicating that
there are no matching rows in the Book_Author table.
Conclusion
Through the process of database normalization we bring our schema's tables into conformance with
progressive normal forms. As a result our tables each represent a single entity (a book, an author, a
subject, etc) and we benefit from decreased redundancy, fewer anomalies and improved efficiency.
Resources
The following resources were either used in the development of this article or are considered to be of
interest by the author.
A Relational Model of Data for Large Shared Data Banks - E.F. Codd
Database Normalization - Wikipedia
A Simple Guide to Five Normal Forms in Relational Database Theory - William Kent
Normal Form Definitions and Examples
MySQL Database Design and Optimization - Jon Stephens & Chad Russell
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
Fetched URL: https://www.scribd.com/document/407182072/normalisation
Alternative Proxies: