THE UNIVERSITY OF TEXAS AT AUSTIN
SCHOOL OF INFORMATION


LIS 384K.11 (known as INF 385M, beginning with the Fall Semester 2003)
DATABASE-MANAGEMENT PRINCIPLES AND APPLICATIONS
R. E. Wyllys

Introduction to Table Concepts


Intoduction

This discussion concerns some basic characteristics of tables. It illustrates the concept of data types and of the key (specifically, primary key) of a table. It also shows how flat-file tables can readily run into problems if we try to put too much information into them.

Data Types

Note that the examples of tables in this handout include data of several types.

The Social Security Number (SSN) columns contain digits and hyphens. If we removed the hyphens (which are not really part of a Social Security Number but are simply the conventional way of separating the digits to make the SSN more easily readable by humans), we would have only digits. Nevertheless, there is no conceivable reason why we would ever want to perform arithmetic operations on these digits. Therefore, we should consider the SSN column entries as alphanumeric data, usually called the "character" or "CHAR" data type in database work. In Microsoft Access this type is called "Text".

In contrast to the SSNs, the entries in the Grade-Point Average (GPA) columns should be treated as true numbers, usually called the "numeric" or "NUM" data type in database work. This is because not only are the GPA entries the result of arithmetic operations, but also we may want to perform further arithmetic operations on them; e.g., we might want to find the average of the GPA values in a certain set of rows.

The Birthdate columns are a particular type of numeric data, the "calendar" or "DATE" data type, called "Date/Time" in Microsoft Access. DATE entries are presented in the table in a format that makes it easy for us humans to interpret the values as calendar dates, but the entries are actually stored as integer numbers that represent the number of days between December 31, 1899 and the date presented in the human-readable format. For example, January 1, 1900 is stored as the number "1", and January 1, 1997 is stored as the number "35431", meaning that 35,431 days elapsed between December 31, 1899 and January 1, 1997. (In similar fashion, January 1, 2000 is stored as "36526"; and January 1, 2001, as "36892".)

The First Name (FNAME) and Last Name (LNAME) columns contain data that are mostly alphabetic, although we have to allow for the possibility of digits in name suffixes like "3d" (actually, Bill Gates uses "III"). Such data are easily seen to be of the CHAR type. The Major column also contains CHAR data.

The G? column contains only "Y" and "N" entries, which are to be interpreted as answers, "yes" or "no", to the question, "Is the student represented by this row a Graduate student or not?" Equivalently, these answers could be viewed as truth values, "True" or "False", applied to the statement, "The student represented by this row is a Graduate student." Thus the data in the G? column are of the "logical" or "Boolean" data type, which Microsoft Access calls "Yes/No". UNIQUE IDENTIFIERS OF ROWS In Example A, the Social Security Number (SSN) easily provides a unique identifier for each row in the table. Hence, the SSN column is an obvious choice for the key of this table. It is an example of a "simple", i.e., single-column, key.

Example A

SSN FNAME LNAME MAJOR G? BIRTHDATE GPA
435-69-3739 William Gates 3d Biology N 03/11/73
3.00
454-67-8876 Henry Aldridge Psychology N 06/07/74
1.95
123-45-6789 John Jones LIS N 11/17/71
3.25
667-83-9631 Elmer Gantry Theology N 07/04/72
2.25
222-33-4444 Lynn  Lee LIS Y 10/09/70
3.90
987-65-4321 Mary Ruiz History Y 05/15/69
2.90
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
777-66-5432 John Doe Philosophy N 09/16/73
1.75
543-21-0987 Ada Byron Comp. Sci. N 02/29/72
3.95

Example B is the same as Example A, except that the easy key choice, the SSN, is not available. What can we use for a key in this case? You can easily verify that all of the columns except the last, the GPA column contain duplicate entries and that the GPA column contains unique entries. So far as Example B, in its present state, is concerned, therefore, it would possible to use the GPA column as the key column.

Example B

FNAME LNAME MAJOR G? BIRTHDATE GPA
William Gates 3d Biology N 03/11/73
3.00
Henry Aldridge Psychology N 06/07/74
1.95
John Jones LIS N 11/17/71
3.25
Elmer Gantry Theology N 07/04/72
2.25
Lynn  Lee LIS Y 10/09/70
3.85
Mary Ruiz History Y 05/15/69
2.90
Raquel Ruiz Management N 05/15/69
3.90
John Doe Philosophy N 09/16/73
1.75
Ada Byron Comp. Sci. N 02/29/72
3.95

This leads us to the question, "Would it be wise to use the GPA column as the key?" The answer is "no," for a couple of practical reasons. The first reason is that it is clear that if we added rows to the table (i.e., if we added information about additional students), we would be very likely to encounter at least some duplication of entries for the GPA column. It is clearly unwise to use as a key a column that is highly susceptible to the possibility of containing duplicates. A second reason is that we are unlikely to want ever to look up a student in a table like this by searching on the student's GPA. As a practical matter, it is usually desirable to use as a key a column on which we are likely to want to make searches (this is not a logical necessity, merely a pragmatic convenience).

It turns out that in Example B no single column is acceptable as the key. But a logically satisfactory key can be formed from a pair of columns (or a triple of columns, or in general, a set of n columns, an "n-tuple"), provided that no pairs (or triples, etc.) of values in the columns are duplicated. Example B contains more than one pair of columns that, viewed as pairs, have no duplicated entries; but a natural choice among the possible pairs is the pair, FNAME with LNAME. It is easy to verify that no two students have the same first and last names, although there are two rows containing "John" in the FNAME column and two rows containing "Ruiz" in the LNAME column. Keys that consist of more than one column are called "composite" keys.

Of course, the possible choice of FNAME and LNAME--though logically acceptable in Example B in its present state--is open to the objection that a larger table would be likely to contain rows with duplicate FNAME-LNAME entries. After all, there are many people in the world with names like "John Jones", "Mary Smith", "Jane Johnson", etc. If we knew that Example B would never be enlarged, then the FNAME-LNAME pair would be satisfactory as a key; but as a practical matter, first name and last name pairs are not acceptable as the key in large tables. This fact of life was one of the major reasons for the adoption of Social Security Numbers in the United States in the 1930s.

Further consideration of the possibilities should suggest to you that a triple formed from the FNAME, LNAME, and BIRTHDATE columns would be likely to work for quite large tables, although even such triples could be duplicated if we were unlucky.

We can note in passing that a table can consist entirely of columns that are key columns. In Example C, the table consists of a single column, which has no duplicate entries. That column is thus the primary key of the table; it is a simple key. This key constitutes the entire table.

Example C

MAJOR
Biology
Psychology
LIS
Theology
History
Management
Philosophy
Comp. Sci.

In Example D the table consists of a pair of columns. Again, there are no duplicate entries in the rows, considered as pairs (or, in this case, whole rows). Thus Example D has a key consisting of the FNAME-LNAME pair. It is a composite key. This key also happens to constitute the entire table.

Example D

FNAME LNAME
William Gates 3d
Henry Aldridge
John Jones
Elmer Gantry
Lynn  Lee
Mary Ruiz
Raquel Ruiz
John Doe
Ada Byron

Example E illustrates another approach to providing a key for tables. The first column contains sequence numbers of the rows. In most database-management systems (DBMSs) such numbers can be easily generated by the DBMS; and they obviously provide, and guarantee, a unique identifier for each row in the table. Thus such sequence numbers can serve as simple keys for their tables. (Note: The sequence-number column can be in any position in the table; it is not restricted to the first column.) Generating sequence numbers in a column in order to provide a guaranteed satisfactory key for the table is a "brute-force" solution to the problem of providing keys. In Microsoft Access the AutoNumber data type is an example of an sequential-number key (if used with the Field Property New Number set to "Increment"; if this is set to "Random", then the numbers are unique, and hence usable as a primary key, even though they are not sequential).

Example E

FNAME LNAME MAJOR G? BIRTHDATE GPA
1 William Gates 3d Biology N 03/11/73
3.00
2 Henry Aldridge Psychology N 06/07/74
1.95
3 John Jones LIS N 11/17/71
3.25
4 Elmer Gantry Theology N 07/04/72
2.25
5 Lynn  Lee LIS Y 10/09/70
3.90
6 Mary Ruiz History Y 05/15/69
2.90
7 Raquel Ruiz Management N 05/15/69
3.90
8 John Doe Philosophy N 09/16/73
1.75
9 Ada Byron Comp. Sci. N 02/29/72
3.95

Problems When Flat Files Contain Too Many Data

Example F illustrates what can happen when we try to put too much information into a single table (a flat-file table). In constructing Example F, I started with just the last three rows from Example A, in order to try to keep down the clumsiness and tediousness of the example. (You should be able easily to imagine what an example with all of the original rows from Example A could look like.) The example is intended to show the complications that arise when we add information, especially multivalued information, to a flat file. You should use the mental image of a data table stored in a spreadsheet, such as Microsoft Excel, to guide you in the following discussion.

Suppose that to the information on students in Example A, we want to add information about their courses in the current semester.

Example F

SSN FNAME LNAME MAJOR G? BIRTHDATE GPA COURSES
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
Economics 101
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
Finite Math. 202
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
U.S. History 301
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
Bus. Admin. 303
777-66-5432 John Doe Philosophy N 09/16/73
1.75
Finite Math. 202
777-66-5432 John Doe Philosophy N 09/16/73
1.75
European Hist. 403
777-66-5432 John Doe Philosophy N 09/16/73
1.75
Logic 105
777-66-5432 John Doe Philosophy N 09/16/73
1.75
Rhetoric 606
543-21-0987 Ada Byron Comp. Sci. N 02/29/72
3.95
Programming 104
543-21-0987 Ada Byron Comp. Sci. N 02/29/72
3.95
Logic 105
543-21-0987 Ada Byron Comp. Sci. N 02/29/72
3.95
U.S. History 301

It is easy to see that this kind of storage is clumsy and contains much redundant information. A natural reaction might be to suggest that there is no need to repeat the information in a following row when it duplicates the information in the previous row; for example,, in rows 2 and 3, we might leave the SSN, FNAME, LNAME, MAJOR, G?, BIRTHDATE, and GPA cells empty. So far as human reading of the table is concerned, this is a reasonable suggestion (and, in fact, good design of output reports for human consumption often provides equivalent "white space" to make the report more attractive and readable for humans). Unfortunately, computer programs are not so good as filling in blanks correctly as humans. Though it would be theoretically possible to write table-processing routines that would supply the missing data, as a practical matter this would be very costly. (The costs would involve no only the extra computer operations but also the extra human time to write such programs in the first place.)

Example G

SSN FNAME LNAME MAJOR G? BIRTHDATE GPA COURSES
888-99-1234 Raquel Ruiz Management N 05/15/69
3.90
Economics 101
Finite Math. 202
U.S. History 301
Bus. Admin. 303
777-66-5432 John Doe Philosophy N 09/16/73
1.75
Finite Math. 202
European Hist. 403
Logic 105
Rhetoric 606
543-21-0987 Ada Byron Comp. Sci. N 02/29/72
3.95
Programming 104
Logic 105
U.S. History 301

Example G is intended to help you to illustrate the problem for yourself. It is Example F with the data in the first seven columns removed from rows following the first entry for each student. Think about how you might write a program that will search the table in Example G in order to produce a report listing the names and other data on each student who is currently taking the course entitled Finite Mathematics 202. A little thinking about this problem should persuade you that it would be easier to program the writing of such a report for Example F than for Example G.

Example F represents the addition of just one additional column to Example A. Suppose that we also wanted to store further information about each student, such as telephone numbers. If we had to expand Example F to include for each student a column for a home phone and a column for a work phone, the table would become still larger and clumsier. And, in a real-life situation, we could easily need to store several more items of information about each student: e.g., college address, permanent-residence address, name of a person to be contacted in an emergency, telephone number of the emergency-contact person, address of the emergency-contact person, the student's faculty advisor, the courses that the student has completed in prior semesters, etc. Clearly, a single file containing such a variety and volume of information would quickly become extremely clumsy.

There are other problems also. Suppose, for example, that the name of U.S. History 301 is to be changed to American History 301. To incorporate this change into Example F, we would have to find every occurrence of the string "U.S. History 301" and change it to "American History 301". As a string-matching and -replacement operation, this would not be too difficult; but it would require some computer time.

A more subtle problem is this: Suppose that the information in Example F is all that we know about the courses offered by the college that these students are attending. Though the college also offers, let us say, a course entitled "Music Appreciation 104", there is no way of knowing this on the basis of the information provided by Example F. But, you may say, surely there is another table somewhere that contains the names of all the courses offered by the college. That may be the case, but even if such a table does exist elsewhere, flat-file systems (e.g., spreadsheets) offer us no easily programmable way of linking the two sets of data. Furthermore, the information in the all-the-courses-in-the-college table will duplicate some of the information in Example F, so that we will have redundant information from which we get no benefit. (This situation stands in contrast to the benefits that can stem, as you will learn, from the minimal redundancy that is designed into relational databases.)

In brief, flat-file systems are fine for small data sets in which only a few items of data are to be stored for each entity handled by the system. When many items of data are to be stored, flat-file systems become clumsy. If some of the data for each entity are multivalued (e.g., more than 1 course per student, or more than 1 work-telephone number for each person), then flat-file systems become very clumsy very quickly.

 


Last revised 2004 Feb 23