Database Keying

Database Keys are incredibly important. They provide a method to uniquely identify a row of data, ensuring that duplicate data doesn’t make it into tables, and providing method to guarantee that you’re referring to the correct row on an Update or Delete statement. Generally, most database engines will automatically index a table based on the Primary key. Further Indices or Unique constraints can be defined based on what you’re trying to accomplish, but to my way of thinking the Primary Key should handle the majority of the Unique constraints for your program.

Even with the Primary Key fields, which can contain as many fields as is necessary to uniquely identify a row, many developers still use numeric “Identity” fields to provide a single field that is created by the server. A lot of people use Identities as a means to uniquely identify rows when joining data together, which I think is just foolish. As an example:

+---------------+       +---------------+
| Products      |       | Orders        |
+---------------+       +---------------+
| ID (Identity) |       | ID (Identity) |
| ProductNumber |       | Customer      |
| Description   |       | ShippingAddr  |
| Price         |       | BillingAddr   |
+---------------+       +---------------+

Now, in the above table, Orders.ID makes sense as a Primary Key. Most companies track orders as a sequence of orders starting from some point (sometimes one, sometimes not) and then keep incrementing that value. Now, if I were designing a real eCommerce site, I wouldn’t link the Orders directly to the Products, like I’m going to do here, I’d like Orders to a Cart, which aids in occasionally deleting old, outdated carts. For the sake of simplicity, we’re going to ignore the cart system.

The question immediately becomes how do you associate the product with the order? In a relational database, there isn’t any means t directly display a one-to-many relationship, so we’re forced to create an intermediate table to render that link. This table can also include data that might change from the product listing from time to time, like the unit price, or any other data that is necessary in the context of an order. The OrderProduct can be linked to the Products based on either the ID or the ProductNumber (the ProductNumber usually being meant for human consumption), and the Order based on the Order Id.

+--------------------+       +-------------------+      +---------------+
| Products            |       | OrderProducts |      | Orders        |
+--------------------+       +-------------------+      +---------------+
| ID (Identity)       |<+    | ID (Identity)     |   +>| ID (Identity) |
| ProductNumber |  +->| ProductId        |    |   | Customer      |
| Description        |        | OrderId          |<-+  | ShippingAddr  |
| Price                 |         | Quantity         |      | BillingAddr   |
+--------------------+        | UnitPrice         |      +---------------+
                                     +------------------+

The question then, is whether or not to use the ProductId or the ProductNumber to key against the products. The answer is in two parts. First, the ProductNumber will need to be Unique, since multiple products with the same number would be difficult to manage. I would argue that the best means to make it unique would be to the make it the Primary Key, at which point an index is automatically created for it. Since the field is then indexed, I would make that item the Foreign Key, requiring any joins to join based on that field.

Unfortunately, this raises a question of mutability. Should product numbers be allowed to change? The obvious answer is “No, how would you be able to verify that your data is correct?” However, time and again, I’ve seen systems that would happily allow Product Numbers to be changed at any time. The better of these systems ensured that they were keying off of the Identity field, whose uniqueness is not guaranteed by the database.

What if your primary key is not a single field? For instance, in the Course Catalog data at work, the primary key is the Course Prefix, Course Number, as well as the Year/Term the course began. A total of four fields, which allows us to “span” the course data in order to reflect how a course’s information has changed over the years. A new project was to link the display of a course to one or more campuses, as Washington State University maintains it’s main campus and three ‘urban’ campuses. To do this, initially I was going to simply use the Course Identity field to link the course to an Academic Unit. Quickly, I realized that this caused the data to be invalidated the moment a user created a new ‘span’ of a course. Ultimately, we don’t care about when a course was offered at a specific campus, we offer PDF copies of old catalogs for students to peruse, the database is being used to drive the current data only. If I were to be planning for students to choose to view a Units offerings for any term, then I would have to store the span data in the joining table, luckily we have historic scheduling data which can offer a similar experience (at least back to 2000) for students to determine what courses were offered which terms.

Since users did not want to have to rebuild those associations every time a course spanned, I choose to switch to using the partial key consisting of Prefix and Course Number to link into the courses data. This allows me to easily say that “this course is available through these units”, and still keep things open to whether we want to see the course as it is today, or as it was ten years ago. The added benefit is simple, the database is more human readable.

I think a lot of developers forget that ultimately the data needs to be consumed by people. Sure, it takes longer to compare a two part key consisting of text and numberics than it does a single numeric field, but as Moore’s law has continued, that time really is becoming negligible, and that single field numeric doesn’t mean anything to anyone perusing the data. I’m sure many people will argue that we can simply write software to remodel the data to be more easily read by humans, but we can’t forget that, as developers, we too are consumers of that data. The harder it is for use to identify how the data is to be consumed, the harder it will be for us to maintain that software in the future.

If we recognize that data is to be consumed by humans, I believe that the question of mutability becomes easier to answer as well. If we identify a product as a BIB-2121, it is difficult to change our thinking to identify that product by a different name. I think this is where my background in accounting comes in handy, as Accountants view that BIB-2121 as something different than other people in the organization. Most people learn to associate a BIB-2121 with what that product is and looks like, maybe Bob’s Interboundary Batter, or whatever. For them, BIB-2121 is simply an alias, and changing that alias is not a difficult procedure. In the accounting world, however, that product is unique and special. That it is a BIB-2121 means something. It’s on all of the Invoices, Purchase Orders, etc. Every document that the company has printed in relation to the Bob’s Interboundary Batter references that number. The database may have assigned some random number internally to that item, let’s say 21411, but as far as the customer’s concerned, it’s a BIB-2121.

This is the real problem with Mutable identifiers. If an Identifier changes, then all of the sudden there is a lot of incorrect data out there for the company. If you print out a new invoice or statement for a customer, suddenly it appears that they’ve bought something other than what you sold them. For Accountants, it’s important to maintain that history, you want to be positive that the invoice copy you print today will be the same as an invoice copy you print a year from now (except that the one in the future will hopefully have been marked paid). If your paperwork, and your customer’s paperwork were to differ, there could be big trouble.

Those identifiers can change, but for an accountant, you’re simply creating a new item, in a new item number, and transferring any old stock from the old item into the new item. This also prevents you from recycling old Item numbers (accidentally, at least), which again can cause confusion. The more precisely you can model your data, the better. This holds especially true for links. I may not know what Identifier 21211 is, but odds are I can remember that BIB-2121 is Bob’s Interboundary Batter. If I’m ever digging through the data, which in this day and age of Data Warehousing is more and more common, I want to have to work as little as possible to analyze the data. And you can bet that my boss wants to work even less than I’m willing to.

Multipart keys may seem inefficient. They may seem unnecessary to the average developer, but remember, you’re writing software for people, not machines. Make the machine work harder. That’s what it’s for. You may only intend for the user to access your data through your interfaces, but some user won’t want to do that, and the way your data is structured says just as much about you as a developer as any code you may write.

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.” — Fred Brooks.