One of the more complicated topics in the area of database management is the process of normalizing the tables in a relational database. These notes are intended to provide you with an overview of this topic, which I hope will be helpful to you after you have gained some familiarity with the ideas of, and techniques used in, normalization.
The underlying ideas in normalization are simple enough. Through normalization we want to design for our relational database a set of files that (1) contain all the data necessary for the purposes that the database is to serve, (2) have as little redundancy as possible, (3) accommodate multiple values for types of data that require them, (4) permit efficient updates of the data in the database, and (5) avoid the danger of losing data unknowingly.
The primary reason for normalizing databases to at least the level of the 3rd Normal Form (the levels are explained below) is that normalization is a potent weapon against the possible corruption of databases stemming from what are called "insertion anomalies," "deletion anomalies," and "update anomalies." These types of error can creep into databases that are insufficiently normalized.
An "insertion anomaly" is a failure to place information about a new database entry into all the places in the database where information about that new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted into only one place in the database; in an inadequately normalized database, information about a new entry may need to be inserted into more than one place, and, human fallibility being what it is, some of the needed additional insertions may be missed.
A "deletion anomaly" is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the database; in an inadequately normalized database, information about that old entry may need to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional deletions may be missed.
An update of a database involves modifications that may be additions, deletions, or both. Thus "update anomalies" can be either of the kinds of anomalies discussed above.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalized databases are much less susceptible to corruption than are unnormalized databases.
Normalization can be viewed as a series of steps (i.e., levels) designed, one after another, to deal with ways in which tables can be "too complicated for their own good". The purpose of normalization is to reduce the chances for anomalies to occur in a database. The definitions of the various levels of normalization illustrate complications to be eliminated in order to reduce the chances of anomalies. At all levels and in every case of a table with a complication, the resolution of the problem turns out to be the establishment of two or more simpler tables which, as a group, contain the same information as the original table but which, because of their simpler individual structures, lack the complication.
In practice, accomplishing normalization is often fairly simple. Confining the entries in any one table to data related to a single set of related attributeswhat I like to call "single-theme tables"will usually do the job. By a "single-theme table" I mean a table that concentrates either on one concept (i.e., typically, one entity) in the situation or on one relationship in the situation. The examples later in this lesson concern a hypothetical discussion of how to set up a database dealing with puppies, kennels, and tricks performed by the puppies. In terms of these examples, you will see single-theme tables dealing with one concept, (e.g., with just puppies and their names, with just tricks and the names of tricks) and with one relationship (e.g., pairings of puppies and tricks). You will also see some tables that are not single-theme tables, and you will see some of the problems that ensue from their failing to concentrate on a single theme.
If, instead of using the single-theme approach, you set out to normalize the tables in a database via a definitional approach (i.e., carefully examining tables in terms of the definitions of the various levels of normal forms), you may encounter some difficulty, or at least some tedium, in achieving complete surety that you have achieved a high level of normalization. The formal rules that follow provide a summary of the normalization process, but you will need to study them carefully and to work through several examples before you can start to feel comfortable in your understanding of normalization. Here are the formal rules of normalization, presented primarily for reference and as a summary, useful after you have learned about normalization through more detailed discussions elsewhere.
Note: The examples below (in the section entitled "An Outline of Normalization by Marc Rettig," following the section on "Formal Definitions of the Normal Forms") are intended to present a simplified introduction to the ideas of, and arguments for, normalization of databases. For a more detailed discussion of normalization, one source is my lesson entitled Steps in Normalization.
Def: A table (relation) is in 1NF if
1. There are no duplicated rows in the table.
2. Each cell is single-valued (i.e., there are no repeating groups or arrays).
3. Entries in a column (attribute, field) are of the same kind.
Note: The order of the rows is immaterial; the order of the columns is immaterial.
Note: The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one columneven, possibly, of all the columns).
Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of the key.
Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies."
Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies.
Def: A table is in BCNF if it is in 3NF and if every determinant is a candidate key.
Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.
Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table.
Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains.
Here is an outline of the ideas of normalization. It is adapted from a presentation prepared, in the form of an attractive and information-rich poster, by Marc Rettig, Technical Editor, Database Programming & Design magazine. He uses examples from a hypothetical discussion of how to set up a database dealing with puppies, kennels, and tricks performed by the puppies. The words that follow are Mr. Rettig's, except for my comments in square brackets.
Rules of Normalization
Rule 1. Eliminate Repeating Groups. Make a separate table for each set of related attributes, and give each table a primary key.
| Unnormalized Data Items for Puppies Puppy
number |
In the original list of data, each puppy's description is followed by a list of tricks the puppy has learned. Some might know ten tricks, some might not know any. To answer the question, "Can Fifi roll over?", we need first to find Fifi's puppy record, then scan the list of tricks asociated with the record. This is awkward, inefficient, and extremely untidy.
Moving the tricks into a separate table helps considerably. Separating the repeating groups of tricks from the puppy information results in first normal form. The puppy number in the trick table matches the primary key in the puppy table, providing a foreign key for relating the two tables with a join operation. Now we can answer our question with a direct retrieval: look to see if Fifi's puppy number and the trick ID for "roll over" appear together in the trick table.
| FIRST NORMAL
FORM Puppy Table Trick
Table |
[Note that Rettig's Rule 1, "Eliminate Repeating Groups," has the effect of ensuring that the tables satisfy the conditions on what we have called a "well-formed" table in a relational database; for well-formed tables are ones that satisfy the conditions of the 1st Normal Form. Note also that in the Trick Table, the primary key is a multi-valued, or "composite," key, which consists of both Puppy Number and Trick ID.]
Rule 2. Eliminate Redundant Data. If an attribute depends on only part of a multi-valued key, remove it to a separate table.
| TRICK TABLE | ||||
| Puppy Number | Trick ID | Trick Name | Where Learned | Skill Level |
| 52 | 27 | Roll Over | 16 | 9 |
| 53 | 16 | Nose Stand | 9 | 9 |
| 54 | 27 | Roll Over | 9 | 5 |
The trick name (e.g., "Roll Over") appears redundantly for every puppy that knows it. Just Trick ID would do.
[Note that Trick Name depends on only a part (the Trick ID) of the multi-valued, i.e., composite, key.]
| SECOND
NORMAL FORM Puppy Table
Tricks
Puppy Tricks |
In the Trick Table, the primary key is made up of the puppy number and the trick ID. This makes sense for the "Where Learned" and "Skill Level" attributes, since they will be different for every puppy-trick combination. But the trick name depends only on the Trick ID. The same name will appear redundantly every time its associated ID appears in the Trick Table.
Suppose you want to reclassify a trick, i.e., to give it a different trick ID. The change has to be made for every puppy that knows the trick! If you miss some of the changes, you will have several puppies with the same trick under different IDs. This is an update anomaly.
Or suppose the last puppy knowing a particular trick gets run over by a car. His records will be removed from the database, and the trick will not be stored anywhere! This is a delete anomaly. To avoid these problems, we need the second normal form.
To achieve this, separate the attributes depending on both parts of the key from those depending only on the Trick ID. This results in two tables: "Tricks," which gives the name for each Trick ID; and "Puppy Tricks," which lists the tricks learned by each puppy.
[Note that Rettig's Rule 2, "Eliminate Redundant Data," is aimed at tables that fail to satisfy the conditions imposed on tables by the definition of 2nd Normal Form. In this example it takes the "Trick" table, which fails to be in 2NF because the Trick Name attribute depends only on the Trick ID attribute (i.e., on just part of the composite key), and turns "Trick" into two tables. Note also that the result is a set of tables that come close to being single-theme tables. "Tricks" is clearly a single-theme table. "Puppy Tricks" is a single-theme table because it pairs puppies and their tricks, i.e., it deals with the relationships between individual puppies and the tricks they know. In "Puppy Tricks," the "Trick Where Learned" attribute is clearly dependent on both parts of the key, since the attribute is based not only on which particular trick is being referred to but also on where the particular puppy learned that trick. The same is true of the "Skill Level" attribute, since this attribute is based not only on which particular trick is being referred to but also on what the particular puppy's level of skill on that trick is.]
Now we can reclassify a trick in a single operation: Look up the Trick ID in the "Tricks" table and change its name. The result will instantly be available throughout the application.
Rule 3. Eliminate Columns Not Dependent on Key. If attributes do not contribute to a description of the key, remove them to a separate table.
| Puppy
Table Puppy Number Puppy Name Kennel Code Kennel Name |
The Puppy Table satisfies the first normal form, since it contains no repeating groups. It satisfies the second normal form, since it does not have a multivalued key. But the key is Puppy Number, and the kennel name and kennel location describe only a kennel, not a puppy. To achieve the third normal form, they must be moved into a separate table. Since they describe a kennel, Kennel Code becomes the key of the new "Kennels" table.
| THIRD
NORMAL FORM Puppies Kennels Tricks Puppy Tricks |
The motivation for this is the same as for the second normal form: We want to avoid update and delete anomalies. For example, suppose no puppies from the Daisy Hill Puppy Farm were currently stored in the database. With the previous design, there would be no record of its existence!
[Note that Rettig's Rule 3, "Eliminate Columns Not Dependent on Key," amounts to removing the transitive dependencies that keep a table from achieving 3rd Normal Form. In the "Puppy" table, attributes Kennel Name and Kennel Location are dependent only on Kennel Code. This situation is superficially similar to that of the dependence on part of a composite key that is the focus of the 2nd Normal Form. However, this situation differs in that here the attribute (viz., Kennel Code) on which other attributes are dependent is not part of a composite key; it is simply one of the non-key attributes in the "Puppy" table. Note also that "Kennels" is clearly a single-theme table. "Puppies" is close to a single-theme table, in that it concentrates on puppies and their numbers. If the real-life situation iswhich is likelythat any given puppy can be identified with one and only one kennel, then we can consider "Puppies" to be a single-theme table. But ifwhich is not likelypuppies sometimes start life in one kennel and are then transferred, as puppies, to another kennel, we would not be able to consider "Puppies" to be single-theme because we would have to make further modifications to handle the mixed origins.]
Third Normal Form is sufficient for most situations. But if that isn't normal enough for you ...
Rule 4. Isolate Independent Multiple Relationships. No table may contain two or more l:n or n:m relationships that are not directly related.
This applies only to designs that include one-to-many and many-to-many relationships. An example of a one-to-many relationship is that one kennel can hold many puppies. An example of a many-to-many relationship is that a puppy can know many tricks and several puppies can know the same trick.
| Puppy Tricks and Costumes Puppy Number Trick ID Trick Where Learned Skill Level Costume |
Suppose we want to add a new attribute to the Puppy-Trick table, "Costume." This way we can look for puppies that can both "sit-up-and-beg" and wear a Groucho Marx mask, for example. The fourth normal form dictates against this (i.e., against using the Puppy-Tricks table, not against begging while wearing a Groucho mask). The two attributes do not share a meaningful relationship. A puppy may be able to wear a wet suit. This does not mean it can simultaneously sit up and beg. How will you represent this if you store both attributes in the same table?
| FOURTH NORMAL
FORM Puppies Kennels Tricks Puppy Tricks Costumes Puppy
Costumes |
[Note that Rettig's Rule 4, "Isolate Independent Multiple Relationships,"amounts to the elimination of multi-valued dependencies that is the focus of the 4th Normal Form. In table "Puppy Tricks and Costumes," attribute Costume may or may not be dependent on attribute Trick ID as well as on attribute Puppy Number. We remove the ambiguity by constructing two replacement tables, "Costumes" and "Puppy Costumes," each of which is clearly a single-theme table. At left is the 4th Normal Form of the database. It consists of the Third Normal Form tables plus the new tables. Note also that table "Puppy Costumes" has a composite primary key that is made up of two foreign keys, from the "Puppies" and "Costumes" tables, respectively.]
Rule 5. Isolate Semantically Related Multiple Relationships. There may be practical constraints on information that justify separating logically related many-to-many relationships.
Usually, related attributes belong together. For example, if we really wanted to record which tricks each puppy could do in which costume, we would want to keep the Costume attribute in the Puppy-Tricks table. But there are times when special characteristics of the data make it more efficient to separate even logically related attributes.
Imagine that our database will record when breeds are available in each kennel, and which breeder supplies dogs to those kennels. This suggests a Kennel-Breeder-Breeds table, which satisfies fourth normal form. As long as any kennel can supply any breed from any breeder, this works fine.
| Kennel-Breeder-Breeds Kennel Number Breeder Breed |
| Kennel-Breeder-Breeds | ||
| Kennel Number | Breeder | Breed |
| 5 | Acme | Spaniel |
| 5 | Acme | Dachshund |
| 5 | Acme | Banana-Biter |
| 5 | Puppy Factory | Spaniel |
| 5 | Puppy Factory | Dachshund |
| 5 | Puppy Factory | Banana-Biter |
| 5 | Whatapuppy | Spaniel |
| 5 | Whatapuppy | Dachshund |
| 5 | Whatapuppy | Banana-Biter |
Now suppose a law is passed to prevent exclusive arrangements: a kennel selling any breed must offer that breed from all breeders it deals with. In other words, if Khabul Khennels sells Afghans and wants to sell any Daisy Hill puppies, it must sell Daisy Hill Afghans.
The need for fifth normal form becomes clear when we consider inserts and deletes. Suppose that a kennel (whose number in the database happens to be 5) decides to offer three new breeds: Spaniels, Dachshunds, and West Indian Banana-Biters. Suppose further that this kennel already deals with three breeders that can supply those breeds. This will require nine new rows in the database, one for each breeder-and-breed combination.
Breaking up the table reduces the number of inserts to six. Here are the tables necessary for fifth normal form, shown with the six newly inserted rows.
| FIFTH NORMAL
FORM Puppies Kennels Tricks Puppy Tricks Costumes Puppy
Costumes Kennel-Breed Kennel-Breeder |
If an application involves significant update activity, fifth normal form can mean important savings. Note that these combination tables develop naturally out of entity-relationship analysis.
[Note also that each of the tables at the left is clearly a single-theme table.]
| Kennel-Breed | |
| Kennel Number | Breed |
| 5 | Spaniel |
| 5 | Dachshund |
| 5 | Banana-Biter |
| Kennel-Breeder | |
| Kennel Number | Breeder |
| 5 | Acme |
| 5 | Puppy Factory |
| 5 | Whatapuppy |
Go to Course Description: Database-Management
Principles and Applications.
Go to Course Schedule: Database-Management Principles
and Applications.
Go to Guide
to Course Materials for LIS 384K.11.
Go to Wyllys Webpage.
Last revised 2005 Mar 11