In discussing database management, it is useful to be able to make distinctions among several types of keys that can occur in tables (scil., relations or files). Unfortunately, it is also easily possible to become somewhat confused by the various types of keys and the various definitions that different writers on database management use for the different types of keys. In this handout we are going to explore the different kinds of keys and the terminology connected with them.
The term superkey is defined by Rob and Coronel (Endnote 1) by saying that "a superkey is any key that identifies each entity [scil., row, record, or tuple] uniquely. In short, the superkey functionally determines all of the entity's attributes" (scil., columns or fields). Hansen and Hansen (Endnote 2) provide another definition: "Any set of attributes that uniquely identifies each tuple in a relation is termed a 'superkey'."
Not all writers in the field of database management are so precise as the two pairs just quoted. For example, Kroenke (Endnote 3) says simply that "the term 'key' means one or more columns that uniquely identify a row in a relation"; more formally, he defines "key" as a "[g]roup of one or more attributes that uniquely defines a tuple in a relation." In doing so, Kroenke is like many writers on database management in using the term "key" to refer in general to superkeys, candidate keys, primary keys, and even secondary keys, terms which are distinguished by some writers, including Rob and Coronel and the Hansens. (Unfortunately, terminology in the database-management field is not yet as well established as one would like it to be.)
The term candidate key is defined by Rob and Coronel as a "superkey without redundancies," i.e., a superkey that lacks a subset of components that by itself provides unique identification of each entity. Thus for Rob and Coronel, a candidate key is what we might call a "minimal superkey."
I think it is fair to say that in limiting the term "candidate key" to mean only minimal superkeys, Rob and Coronel are being fussier than most writers. Usually, the term "candidate key" is used to mean any combination of attributes that (1) uniquely identify each row in the table being discussed and (2) have no null values. That is, most people (including me) use the term "candidate key" to include both of the two concepts, "superkey" and "candidate key," that Rob and Coronel distinguish in their strict definitions.
For example, let us look at how Hansen and Hansen treat these terms. First, we note that they define a key as a "minimal set of attributes that uniquely identifies each row in a relation"; they imply, but do not explicitly mention, the requirement of no null values among these attributes. Then they define candidate key by saying, "In any given relation, there may be more than one set of attributes that could be chosen as a key. These are called 'candidate keys'." They also express the idea of "candidate key" as any "set of attributes that could be chosen as a key of a relation." We can note, however, that these definitions do not really distinguish the Hansens' meaning of "candidate key" from their meaning of "superkey."
The term primary key is defined by Rob and Coronel as a "candidate key selected to uniquely identify all other attribute values in a given row. [A primary key c]annot contain null entries." Hansen and Hansen say that when "one of the candidate keys is selected as the relation key, it may be called the 'primary key'." They also define the primary key as the "candidate key designated for principal use in uniquely identifying rows in a relation."
Both of these definitions convey the idea that a relation may often have more than one possible primary key. Indeed, the final decision on which attribute (or set of attributes) is to be treated as the primary key in a particular relation may be a somewhat arbitrary choice. In other words, the choice of what will be the primary key may depend on factors external to the table, such as the convenience of the database designers or the database users (provided, of course, that what is chosen as the primary key meets the condition of being a unique key for the table).
There are some more definitions of types of keys to be discussed, but let us first provide some examples of what we have talked about thus far.
We
can illustrate the ideas by the following example. Here is a table that we shall
call CUSTOMER:
CUSTOMER
| CUS_NUM | CUS_LNAME | CUS_FNAME | CUS_INIT | CUS_DOB | REP_NUM |
| 10008 | Simmons | Jill | H. | 08/12/37 | 37 |
| 10009 | Barker | John | W. | 11/23/52 | |
| 10010 | Ortega | Roberto | A. | 02/19/63 | 14 |
| 10011 | Williams | June | K. | 12/15/52 | 21 |
This table has a number of superkeys. First, we note that the REP_NUM attribute has a null value and, hence, cannot be part of any superkey. The other attributes have no null values and no duplicate values in this table as presented (because, for these other attributes, nulls and duplicates were avoided by design in this example). Hence, with the exception of the REP_NUM attribute, each attribute and all combinations of attributes possess the possibility of forming parts of superkeys. In fact, in the table as presented the various superkeys are the following (the groupings into singletons, pairs, triples, etc., of attributes are arbitrary and are intended only to aid readability):
CUS_NUM
CUS_LNAME
CUS_FNAME
CUS_INIT
CUS_DOB
CUS_NUM
and CUS_LNAME
CUS_NUM and CUS_FNAME
CUS_NUM and CUS_INIT
CUS_NUM and CUS_DOB
CUS_LNAME
and CUS_FNAME
CUS_LNAME and CUS_INIT
CUS_LNAME and CUS_DOB
CUS_FNAME and CUS_INIT
CUS_FNAME and CUS_DOB
CUS_INIT and CUS_DOB
CUS_NUM,
CUS_LNAME, and CUS_FNAME
CUS_NUM, CUS_LNAME, and CUS_INIT
CUS_NUM, CUS_LNAME, and CUS_DOB
CUS_NUM, CUS_FNAME,
and CUS_INIT
CUS_NUM, CUS_FNAME, and CUS_DOB
CUS_NUM, CUS_INIT, and CUS_DOB
CUS_LNAME,
CUS_FNAME, and CUS_INIT
CUS_LNAME, CUS_FNAME, and CUS_DOB
CUS_LNAME, CUS_INIT, and CUS_DOB
CUS_FNAME, CUS_INIT, and CUS_DOB
CUS_NUM,
CUS_LNAME, CUS_FNAME, and CUS_INIT
CUS_NUM, CUS_LNAME, CUS_FNAME, and CUS_DOB
CUS_NUM, CUS_FNAME,
CUS_INIT, and CUS_DOB
CUS_NUM, CUS_LNAME, CUS_INIT, and CUS_DOB
CUS_LNAME, CUS_FNAME, CUS_INIT, and CUS_DOB
CUS_NUM, CUS_LNAME, CUS_FNAME, CUS_INIT, and CUS_DOB
All 31 of the above single attributes and combinations of attributes are superkeys for the CUSTOMER table. That is, each of the above 31 single attributes and combinations of attributes provides a unique identifier for each row in the table, thus satisfying the definition of superkey.
What are the candidate keys in the CUSTOMER table as presented? Because of Rob and Coronel's requirement of minimality in their definition of "candidate key," they would accept as candidate keys only the five single attributes: viz., CUS_NUM, CUS_LNAME, CUS_FNAME, CUS_INIT, and CUS_DOB. Most writers, however, would call all 31 combinations "candidate keys" for the table as presented.
(You may already be wondering which of the single-attribute candidate keys would be the most sensible choice for primary key of this table? If you think about it for a moment, you will realize that the CUS_LNAME, CUS_FNAME, CUS_INIT, and CUS_DOB attributes all share the liability that it would be quite easy for them to have duplicates if more rows were added to the table. Only the CUS_NUM attribute can readily be guaranteed to be unique, because whenever a new customer needs to be added to the table, you can choose a non-duplicative number for him or her. We will come back to this matter shortly.)
By the way, we can note that the only reason why the REP_NUM column was excluded from consideration in the foregoing is that it has a null value in the CUSTOMER table, as presented. If the table had had, say, value 35 as the REP_NUM for the row for John Barker, then we would have had to include this column in the working out of the possible superkeys. (In this case, if you undertake to work out all the possible superkeys, you will find that there are 63 of them, formed by all possible combinations of the six columns taken one at a time, two at a time, and so on, up to six at a time.) On the other hand, if Barker's REP_NUM had been, say, 21, then the REP_NUM columns would still have been excluded, this time for lack of uniqueness.
We have
been talking about the CUSTOMER table as presented above, i.e., with only
four rows of data. At this point we need to consider what the effects might be
of data in additional rows in the CUSTOMER table. As noted earlier, it is clear
that, with additional rows, it would be easy for duplicate values to occur in
the CUS_LNAME, CUS_FNAME, CUS_INIT, and CUS_DOB columns and, as already mentioned,
in the REP_NUM column. For example, suppose we add two rows to the table, as follows:
CUSTOMER
| CUS_NUM | CUS_LNAME | CUS_FNAME | CUS_INIT | CUS_DOB | REP_NUM |
| 10008 | Simmons | Jill | H. | 08/12/37 | 37 |
| 10009 | Barker | John | W. | 11/23/52 | |
| 10010 | Ortega | Roberto | A. | 02/19/63 | 14 |
| 10011 | Williams | June | K. | 12/15/52 | 21 |
| 10012 | Doe | John | E. | 01/30/60 | 35 |
| 10013 | Roe | Mary | A. | 07/04/65 | 21 |
Then the CUS_FNAME and CUS_INIT columns no longer contain unique values in each row, and hence can no longer be included in the superkeys and candidate keys. From the preceding list of superkeys and candidate keys we must delete every line that contains either CUS_FNAME or CUS_INIT. To put it another way, now only the CUS_NUM, CUS_LNAME, and CUS_DOB columns can participate in the formation of superkeys. This leaves us with the following much shorter list:
CUS_NUM
CUS_LNAME
CUS_DOB
CUS_NUM
and CUS_LNAME
CUS_NUM and CUS_DOB
CUS_LNAME and CUS_DOB
CUS_NUM, CUS_LNAME, and CUS_DOB
Of these superkeys, Rob and Coronel wouldbecause of their insistence on minimalityaccept only the single attributes CUS_NUM, CUS_LNAME, and CUS_DOB as candidate keys.
It is clear that with still more rows added to the CUSTOMER table, it would be quite possibleindeed, likelythat there would be duplicate values in the CUS_LNAME and CUS_DOB columns. In other words, in general these columns cannot be relied on to contain unique values for even a moderately large set of customers. Thus, as a practical matter the CUSTOMER table contains only one real superkey and one real candidate key, the CUS_NUM column. This column, which presumably contains unique values by design, i.e., because each customer is deliberately and carefully assigned a unique identifying number, is the obvious only choice to be the primary key of the CUSTOMER table.
Now we resume our discussion of the definitions of various types of keys. First, we note that the keys defined thus far (viz., superkeys, candidate keys, and primary keys) can be called "true" keys, in that they must have unique values (or unique sets of values) in a table and must also not contain any null values in the table. We are about to discuss two extensions of the concept of key: viz., "secondary keys" and "foreign keys." These extensions fail, in general, to be true keys in the sense just noted.
The term secondary key is defined by Rob and Coronel as "a key that is used strictly for data-retrieval purposes." Hansen and Hansen define a secondary key as a "data item value that identifies a set of records." This definition contrasts usefully with their definition of primary key (which they provide as an alternative to the one quoted earlier): viz., a "data item value that uniquely identifies a record." It is important to note that a secondary key does not need to have unique values in a table; in this respect, secondary keys differ from primary keys (and candidate keys and superkeys). Kroenke emphasizes this by saying that sometimes "applications need to access and process records by a secondary key, one that is different from the primary key. Secondary keys might be unique (such as a professor's name), or they might be non-unique (such as a customer's zip code)."
Though not quite identical, these definitions share the characteristic of labeling as a secondary key a column (or columns) that is not a primary key but is used to retrieve records. For example, it seems clear that we would be likely to want to retrieve, from the CUSTOMER table, records that contained a specified value, e.g., Roe, in the CUS_LNAME column. That is to say, we are likely to have occasion to retrieve the records for all customers with a certain last name. In similar fashion, we are likely to have occasion to retrieve the records for all customers with whom a certain sales representative has responsibility for maintaining contact. Thus, the CUS_LNAME and REP_NUM columns in the CUSTOMER table as presented by Rob and Coronel are plausible secondary keys.
Note, however, that in the expanded CUSTOMER table that I presented above, the REP_NUM column has a duplicate value, which prevents it from serving as a primary key or as a higher-level key. In general, both of these attributes would, in still larger tables, be likely to contain duplicate values, which would prevent them from serving as primary or higher-level keys.
The term foreign key is defined by Rob and Coronel as an "attribute (or combination of attributes) in one table whose values must either match the primary key in another table or be null." Hansen and Hansen define a foreign key as a "set of attributes in one relation that constitute a key in some other . . . relation [and that is] used to indicate logical links between relations." By "key" Hansen and Hansen mean here a column (or set of columns) that is a primary or a candidate key, i.e., that has unique values, in the other relation. Kroenke puts the matter more simply, by saying that when "the key of one relation is stored in another relation, it is sometimes called a 'foreign key'" in the latter relation. McFadden and Hoffer (Endnote 4) write that a "'foreign key' is an attribute that appears as a nonkey attribute in one relation . . . and as a primary key attribute (or part of a primary key) in another relation." As with secondary keys, foreign keys need not be unique in a given table.
One further comment on the terms "secondary key" and "foreign key": These terms are sometimes used synonymously in non-rigorous talk about tables and DBMSs. However, in rigorous discussions: the term "secondary key" can be applied to a column (or set of columns) whose values are not those of a candidate-key or primary-key column (or set of columns) in another table; but the term "foreign key" can be applied only to a column (or set of columns) whose values are those of a primary key in another table. Thus, in rigorous discussions, all foreign keys are also secondary keys, but not all secondary keys are also foreign keys.
We
can illustrate the notion of a foreign key by a table that we shall call SALESREP:
SALESREP
| REP_NUM | REP_ACODE | REP_PHONE | REP_LNAME | REP_SALES |
| 24 | 615 | 8982232 | Poindextrus | 231239.96 |
| 37 | 901 | 6752231 | Goffenbarger | 143578.34 |
| 14 | 615 | 8982233 | Rather | 99231.86 |
| 35 | 615 | 8904536 | Smith | 314678.27 |
| 21 | 615 | 8982231 | Talladegas | 400145.02 |
Clearly, the primary key in this table is intended to be REP_NUM. This column also occurs in the CUSTOMER table, where it constitutes a foreign key and provides the bridge between the two tables.
Note that the REP_NUM column is not a true key in the CUSTOMER table; i.e., it is a nonkey attribute in this table. This is because in the CUSTOMER table REP_NUM fails the two tests of a true key: viz., that it have (1) unique values and (2) no null values. In the CUSTOMER table as presented in Rob and Coronel, the REP_NUM has a null value. Also, as we have illustrated in expanding the CUSTOMER table to six rows of data, in general the REP_NUM column will have duplicate values in the CUSTOMER table, reflecting the fact that in the real world many customers share one sales representative.
But the REP_NUM column is a true key in the SALESREP table, where it presumably has unique values by design and by careful assignment of values. And, as a foreign key in the CUSTOMER table, the REP_NUM column is an indispensable link between the two tables.
1. Rob, Peter; Coronel, Carlos. Database Systems: Design, Implementation, and Management. 4th edition. Cambridge, MA: Course Technology; 2000. 750 p. ISBN:0-7600-1090-0.
2. Hansen, Gary W.; Hansen, James V. Database Management and Design. 548 p. Englewood Cliffs, NJ: Prentice-Hall; 1992. ISBN: 0-13-200759-2.
3. Kroenke, David M. Database Processing: Fundamentals, Design, Implementation. 659 p. New York, NY: Macmillan; 1992. ISBN:0-02-366875-X.
4. McFadden, Fred R.; Hoffer, Jeffrey A. Modern Database Management. 4th edition. 597 p. Redwood City, CA: Benjamin/Cummings; 1994. ISBN:0-8053-6047-6.
Go to Course Description: Database-Management Principles and Applications.
Go to Course Schedule: Database-Management Principles and Applications.
Go to Guide to Course Materials for LIS 384K.11.
Go to Wyllys Webpage.
.
Last revised 2004 Feb 23