March 2008 Archives

Database Keying

1 Comment

Database Keys are incredibly important. They provide a method to uniquely identify a row of data, ensuring that duplicate data doesn’t make it into tables, and providing method to guarantee that you’re referring to the correct row on an Update or Delete statement. Generally, most database engines will automatically index a table based on the Primary key. Further Indices or Unique constraints can be defined based on what you’re trying to accomplish, but to my way of thinking the Primary Key should handle the majority of the Unique constraints for your program.

Even with the Primary Key fields, which can contain as many fields as is necessary to uniquely identify a row, many developers still use numeric “Identity” fields to provide a single field that is created by the server. A lot of people use Identities as a means to uniquely identify rows when joining data together, which I think is just foolish. As an example:

+---------------+       +---------------+
| Products      |       | Orders        |
+---------------+       +---------------+
| ID (Identity) |       | ID (Identity) |
| ProductNumber |       | Customer      |
| Description   |       | ShippingAddr  |
| Price         |       | BillingAddr   |
+---------------+       +---------------+

Now, in the above table, Orders.ID makes sense as a Primary Key. Most companies track orders as a sequence of orders starting from some point (sometimes one, sometimes not) and then keep incrementing that value. Now, if I were designing a real eCommerce site, I wouldn’t link the Orders directly to the Products, like I’m going to do here, I’d like Orders to a Cart, which aids in occasionally deleting old, outdated carts. For the sake of simplicity, we’re going to ignore the cart system.

The question immediately becomes how do you associate the product with the order? In a relational database, there isn’t any means t directly display a one-to-many relationship, so we’re forced to create an intermediate table to render that link. This table can also include data that might change from the product listing from time to time, like the unit price, or any other data that is necessary in the context of an order. The OrderProduct can be linked to the Products based on either the ID or the ProductNumber (the ProductNumber usually being meant for human consumption), and the Order based on the Order Id.

+--------------------+       +-------------------+      +---------------+
| Products            |       | OrderProducts |      | Orders        |
+--------------------+       +-------------------+      +---------------+
| ID (Identity)       |<+    | ID (Identity)     |   +>| ID (Identity) |
| ProductNumber |  +->| ProductId        |    |   | Customer      |
| Description        |        | OrderId          |<-+  | ShippingAddr  |
| Price                 |         | Quantity         |      | BillingAddr   |
+--------------------+        | UnitPrice         |      +---------------+

The question then, is whether or not to use the ProductId or the ProductNumber to key against the products. The answer is in two parts. First, the ProductNumber will need to be Unique, since multiple products with the same number would be difficult to manage. I would argue that the best means to make it unique would be to the make it the Primary Key, at which point an index is automatically created for it. Since the field is then indexed, I would make that item the Foreign Key, requiring any joins to join based on that field.

Unfortunately, this raises a question of mutability. Should product numbers be allowed to change? The obvious answer is “No, how would you be able to verify that your data is correct?” However, time and again, I’ve seen systems that would happily allow Product Numbers to be changed at any time. The better of these systems ensured that they were keying off of the Identity field, whose uniqueness is not guaranteed by the database.

What if your primary key is not a single field? For instance, in the Course Catalog data at work, the primary key is the Course Prefix, Course Number, as well as the Year/Term the course began. A total of four fields, which allows us to “span” the course data in order to reflect how a course’s information has changed over the years. A new project was to link the display of a course to one or more campuses, as Washington State University maintains it’s main campus and three ‘urban’ campuses. To do this, initially I was going to simply use the Course Identity field to link the course to an Academic Unit. Quickly, I realized that this caused the data to be invalidated the moment a user created a new ‘span’ of a course. Ultimately, we don’t care about when a course was offered at a specific campus, we offer PDF copies of old catalogs for students to peruse, the database is being used to drive the current data only. If I were to be planning for students to choose to view a Units offerings for any term, then I would have to store the span data in the joining table, luckily we have historic scheduling data which can offer a similar experience (at least back to 2000) for students to determine what courses were offered which terms.

Since users did not want to have to rebuild those associations every time a course spanned, I choose to switch to using the partial key consisting of Prefix and Course Number to link into the courses data. This allows me to easily say that “this course is available through these units”, and still keep things open to whether we want to see the course as it is today, or as it was ten years ago. The added benefit is simple, the database is more human readable.

I think a lot of developers forget that ultimately the data needs to be consumed by people. Sure, it takes longer to compare a two part key consisting of text and numberics than it does a single numeric field, but as Moore’s law has continued, that time really is becoming negligible, and that single field numeric doesn’t mean anything to anyone perusing the data. I’m sure many people will argue that we can simply write software to remodel the data to be more easily read by humans, but we can’t forget that, as developers, we too are consumers of that data. The harder it is for use to identify how the data is to be consumed, the harder it will be for us to maintain that software in the future.

If we recognize that data is to be consumed by humans, I believe that the question of mutability becomes easier to answer as well. If we identify a product as a BIB-2121, it is difficult to change our thinking to identify that product by a different name. I think this is where my background in accounting comes in handy, as Accountants view that BIB-2121 as something different than other people in the organization. Most people learn to associate a BIB-2121 with what that product is and looks like, maybe Bob’s Interboundary Batter, or whatever. For them, BIB-2121 is simply an alias, and changing that alias is not a difficult procedure. In the accounting world, however, that product is unique and special. That it is a BIB-2121 means something. It’s on all of the Invoices, Purchase Orders, etc. Every document that the company has printed in relation to the Bob’s Interboundary Batter references that number. The database may have assigned some random number internally to that item, let’s say 21411, but as far as the customer’s concerned, it’s a BIB-2121.

This is the real problem with Mutable identifiers. If an Identifier changes, then all of the sudden there is a lot of incorrect data out there for the company. If you print out a new invoice or statement for a customer, suddenly it appears that they’ve bought something other than what you sold them. For Accountants, it’s important to maintain that history, you want to be positive that the invoice copy you print today will be the same as an invoice copy you print a year from now (except that the one in the future will hopefully have been marked paid). If your paperwork, and your customer’s paperwork were to differ, there could be big trouble.

Those identifiers can change, but for an accountant, you’re simply creating a new item, in a new item number, and transferring any old stock from the old item into the new item. This also prevents you from recycling old Item numbers (accidentally, at least), which again can cause confusion. The more precisely you can model your data, the better. This holds especially true for links. I may not know what Identifier 21211 is, but odds are I can remember that BIB-2121 is Bob’s Interboundary Batter. If I’m ever digging through the data, which in this day and age of Data Warehousing is more and more common, I want to have to work as little as possible to analyze the data. And you can bet that my boss wants to work even less than I’m willing to.

Multipart keys may seem inefficient. They may seem unnecessary to the average developer, but remember, you’re writing software for people, not machines. Make the machine work harder. That’s what it’s for. You may only intend for the user to access your data through your interfaces, but some user won’t want to do that, and the way your data is structured says just as much about you as a developer as any code you may write.

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.” — Fred Brooks.

Microsoft Underestimates the Value of Open Source

I was listening to This Week on Channel 9, where they speak about Microsoft releasing the Source code to the ASP.NET MVC Framework preview. The license is a custom license which, unlike the license that the rest of the .NET Class Libraries is available under, allows developers to patch and modify the source. What they don’t grant, and what is incredibly important, is the right to redistribute. At least the channel 9 guys were honest about this.

The right to redistribute is incredibly important. If I fix a bug, or add a feature, or anything else, to Microsoft’s code-base, I am completely unable to share those fixes with anyone else. I could try to get Microsoft to put the fix in the ‘official’ codebase so that the rest of the community can benefit from my code. ScottGu says that this is to prevent multiple versions of the framework from floating around.

While I may understand Microsoft’s feeling on this point, it is very rare that a true fork of an Open Source project is created, since it doesn’t make sense not to work together as a community. Even when someone get’s their feelings hurt and they fork, it’s rare that their fork survives, unless it truly is the superiour codebase. Still, it is ultimately better for everyone if that ability to go out on one’s own is preserved. Since I can’t redistribute my changes, will Microsoft support me if I break the framework for my own uses? Doubtful.

I suspect the real plan of Microsoft is that one the MVC framework is finalized and officially released, the ability to patch will be revoked, and the code will be available under the Reference license, just like the rest of the framework. Ultimately, this means that I’ve made a point of not downloading or installing the source, as I don’t want to risk having seen it, which may make it impossible for me to contribute to Mono. Luckily, the MVC framekwork uses no P/Invokes, and only have 23 issues reported by MoMA, meaning that we should be able to use the Binaries via Mono without too much difficulty. I’m thinking about taking care of those missing pieces, maybe eventually reimplementing MVC on Mono as a true Open Source library. It’s a great framework, but I’d just as soon had Microsoft keep the code internally than perform such a limited source release.

Who Loses? Potentially Everyone

Patrick Durusau, who edits the Open Document Format standard, posted an open letter today asking Who Loses if [Office] Open XML is denied status as an ISO Standard, the vote for which is in process right now.

His arguments for why OOXML is necessary for ODF are interesting: currently ODF doesn’t support Formulae within the standard (soon to be rectified), ODF doesn’t have ISO-approved extensions for old MS Office legacy features (so what?), and ODF doesn’t have a map of the current Microsoft Office formats (why should it?). It appears to me that Durusau is trying to add a lot of MS Office-isms to ODF, and I’m not quite sure why.

Both formats try to solve a similar problem, so there is bound to be a lot of cross-pollination in ideas, however, just because ODF doesn’t support everything that MS Office does at this time doesn’t mean that ODF needs MS OOXML. When Microsoft displaced WordPerfect and Lotus as the premiere Office applications, features were lost, compatibility between formats was not maintained (Excel will import Lotus 1-2-3 files, but I don’t believe it can save them). Hell, I know several accountants that still lament the death of Lotus 1-2-3.

If ODF is lacking support for features that people need, it is the responsbility of the ODF people to add the extensions necessary for those features. Perhaps we should look at the features that MS Office provides, to try to ‘plan ahead’ for features before they’re supported in clients, since we don’t want to force clients to use non-standard ODF extensions, but is OOXML necessary as a standard for that to happen? I don’t think so.

Perhaps, as Miguel de Icaza believes, we’re wasting time focusing on OOXML. Now, while I disagree with Miguel on OOXML’s viability as a ISO standard, I am willing to acknowledge that the community may be a bit too focused on it at this time. However, there are more reasons to be against OOXML than for it. It’s a redundant standard, as ODF already fills that niche. It’s an incomplete standard, as Office 2007 still has an enormous number of extensions to the proposed standard. It’s not a true XML Format, as large parts of the old Binary formats have simply been wrapped into XML blocks. ODF, while missing some features, is a better format. A better standard. We only need on ISO standard for XML-based office formats, and that standard should be the best one. If Microsoft would work with ODF to get the features that they require into the ODF standard (assuming those features are worth standardizing), then we wouldn’t be in the mess that we’re in.

I doubt anyone, except for maybe Roy Schestowitz, wants this unrest to continue. Moreso, we’re tired of getting bad data formats forced upon us, and we certainly don’t want bad data formats forced through ISO. If ODF needs OOXML, it’s needed as an ally, as a thing to be integrated, not as a competing standard.

PlasticSCM 2.0 Released

Yesterday, Codice Software released the first official build of PlasticSCM 2.0. PlasticSCM is a Software Configuration Management (or Source Control) system written entirely in .NET, which gains cross-platform compatibility through the Mono Project. To date, the command line client will run on any platform where Mono is supported, and the GUI is available on Windows, Linux, and Mac OS X.

I’ve written before about Plastic and SCM in general. And I’ve been using Plastic on a trial basis for several months at my current job, including testing all the interim builds of the software. PlasticSCM is a great solution for parallel development, and with some of the new features in 2.0, for distributed teams as well. I really, really enjoyed using Plastic, as I prefer a parallel development model, where I create a large number of branches, keeping the mainline clean and stable. The only problem I had with Plastic’s model, was that it seemed impossible to mark a branch as ‘inactive’ so that I could hide it once I was not doing any active development within that branch.

Ultimately, our office decided against Plastic. A co-worker was an old-school Perforce user, and I had no choice but to admit that today Perforce is a more mature product. It has a published API. It has a triggers and event system which we can leverage for things like keeping our integration server always up to date, or any other continuous integration we might do. I’ve had many conversations via e-mail with the Codice team, and I appreciate their accessibility immensely, and all these things are planned features, but they don’t exist today.

In short, we chose Perforce because it was the best product available today. I personally believe that in six months, or a year, maybe two at the most, Plastic will have those features that I wanted today, and then some. Plastic will continue to be my go-to product for development teams that don’t yet have Source Control. Codice provides excellent support, they’re available, and they listen to customer feedback (and I hadn’t even given them a dime, I’m really sorry for that). Plastic doesn’t pretend to be a complete solution like Microsoft Team System or Borland’s StarTeam, both of which provide heavy issue/defect/feature tracking, as well as integrated project management tools. Plastic exposes data for Project Managers, but doesn’t provide things like Gantt Charts and time-tracking built in.

One thing I’m going to really miss about Plastic as I move toward Perforce at my current job is it’s clean representations of the current state of the source tree. On Perforce, each branch is represented at a new directory under the main depot, and each branch is automatically populated with all the files you’ve marked for the branch, which can include a very large amount of source. In Plastic, you denote the ‘parent’ branch of the new branch, and alternatively a Label to branch from. The new branch is completely empty itself, but it inherits the full state from the parent branch. This is why choosing a label is so important, because without the label the state of parts of the tree can change underneath you, labels keep child branches stable. Items are branched on the individual level only when you choose to commit changes within that branch. It’s a clean mechanism to keep the tree clean, and it also helps ensure that you know with full certainty which branch you’re checking out from and committing against.

Perforce is a very good product, but I don’t see Perforce changing too much in the future. Plastic is a good, but it has a fast, dedicated, responsive development team that I believe will be able to create a great product. I wish them the best, and I’ll strive to send some customers their way when I can.

Electronic Voting, and Why I Dislike It

There has been a lot of talk lately of moving toward Electronic Voting, where we’ll use computer displays to select our candidates and have it tallied by computer. In theory, the idea sounds great. However, rather than have an open development process of such systems, the state governments who’ve gone this route have turned the software (and hardware!) over to companies who refuse to reveal any details of how their devices work.

This will never be okay. How are we, as voters, supposed to feel that our votes are counted correctly when the process is a black box we have no hope to see within? As Ed Felten revealed, how are we to trust these machines, when they can’t even agree on their own counts? And, when asked for explanation and independent analysis, the companies make threats?

Software has bugs. All processes do, really. Things that don’t work as they’re intended. In systems designed by humans, these bugs can be caused by ignorance, sloppiness, even graft. The voting process is one that we need to be particularly careful about graft, as the voting process is supposed to ensure that we get the candidate who will do what we, the people ,want. Dan Wallach, another Freedom to Tinker blogger, argues that perhaps it’s time we eliminate the privacy of individual votes. I’m not sure how I feel about that idea, as I’m sure there are still people out there who could get in trouble for their votes, and making laws against prejudicial action based on voting history isn’t necessarily going to help. However, such a system would at least expose the process to public scrutiny and verification.

And the companies are treating these problems like any other software problem. Voting machines need to be engineered like the Space Shuttle, fault tolerant and tested to the point of being unbreakable. They should be developed in the open, with the code available for perusal, and testing. I should be able to use the same system for my club elections as the government uses for the presidential elections, if nothing else than to verify that the system works.

The solution is open voting. The technology can’t be hidden. The paper trail must be visible. Nearly all government processes should be transparent to the people for the system to function. Voting is one of the most important of these processes and we must ensure that it is kept fair first, or else the entire damn system simply doesn’t matter anymore.


University fee schedules can be ridiculous. Here at Washington State University, a recent request to raise the student fee for the Pullman Transit system was voted down by the Graduate Student/Professional Association (GPSA), some people are upset that the minority of the student body was able to override the Undergrad population, but the grad students feel that there are simply too many fees, and the proposal did nothing to address the needs of the Graduate Student population.

Right now, there are three “Express” buses that run most of the day that have a 15-minute cycle, but all run on the same roads. No buses go by the peripheral parking lots. No buses go by the Campus Graduate Housing. I understand that the Bus system feels that the students (and since the Undergrads are most of the students, the Undergrads) are it’s bread and butter. Part of this is that the buses have no mechanism to verify that all their passengers have paid their bus fees. Anyone with a CougarCard can simply flash that at the bus driver and get on the bus, no questions asked. The bus drivers don’t even really look at the photographs on the cards. No effort is made to ensure that the person has paid the bus fee.

They’d asked for the fee increase because ridership is up. However, I wonder how many people are riding the bus without paying for it? And how many people are riding without paying not understanding that they are supposed to pay? Pullman Transit is a great service to have available (hell, it’s better than Spokane’s Bus system), but they need a better mechanism to track usage and charge for usage, if they plan to be sustainable. They need to offer bus service to the periphery of campus, so that Staff and Graduate students feel better about paying that bus fee. Public Transit is important, but people must feel like they’re getting their money’s worth, or they won’t use it. Busses need to go where you’re going, and they need to do so timely. Pullman Transit is pretty good, but the improvements they needed were not the ones that the Graduate Students needed, so kudos to them.

The next set of fee increases proposed have some elements that are really quite funny. The Student Rec Center currently charges all students a mandatory $128.00 per term. This fee is assessed all students whether they utilize the service or not. The proposal reported on today would raise that fee $8.50 a term per student, which works out to nearly $150,000 increase in their operating budget per term. This is a state of the art facility, and $136.50 for four or five months of use by the students is a really fantastic deal for students, especially for those who are being subsidized by those students who don’t use the facility. As contrast, Faculty or Staff who buy SRC passes will pay $170.00 per term for the privilege.

The best part is the reason that they’re requesting the fee increase:

The reasoning behind the proposed 6.64 percent fee increase is that the Rec Center continues to see increases in the number of students who use the facility as well greater overall frequency of that use, according to documents presented at the meeting.

So, since people are actually USING the facility, they need to charge everyone (including those students who aren’t using it) more? Great. Certainly, they need to charge for their service, but perhaps the SRC should be an opt-in fee. I doubt they’ll go that route, as it eliminates a guaranteed revenue stream that they’ve been happily exploiting for years. At my alma mater we had a gym fee, and an athletic fee that granted us free tickets to sporting events, regardless of if we ever intended to use either (a point of great contention among many of my friends).

Universities have a tendency to nickel-and-dime the student body with a plethora of obscure fees, many for services that students aren’t even aware of. I feel that the WSU Student Recreation Center is a completely worthwhile service, but if they need to raise the rates for all students because more of the student body is utilizing a service they are required to pay for, I believe they need to revisit their funding scheme. The cost of operation may have gone up, but the reason cited suggests that they were banking on the fact that the majority of students would pay for, but never use, their service, which is just wrong.

Hackers & Painters: Big Ideas from the Computer Age

Paul Graham has an interesting story. After earning a Ph.D from Harvard, Mr. Graham went to Europe to study Painting. After a few years of this, he returned to the States to start a new company in 1995 called Viaweb, the first web-based application service provider, who’s product was purchased by Yahoo! in 1998 and re-branded as Yahoo! Stores.

Since then, Mr. Graham has taken to writing essays on his website, which O’Reilley helped compile in 2004 into the book Hackers & Painters. The essays which make up the book consist of Graham’s experiences ranging from being a Nerd in High School (and what that means), to his thoughts on programming languages, and where they’re heading.

The essays are well written, using language that even lay-people should be able to follow. He doesn’t pull punches on the technical details, including quite a few code-samples in his discussion of Programming Langauges and where they’re headed. He argues that all Languages are slowly converging on LISP.

While I am far from fluent in any LISP dialect, I know just enough to be able to read the code, but I can appreciate what he means. LISP was a technology that was decades ahead of it’s time. You could do interesting things with LISP back in the ’60s, but the execution times were simply unacceptable for most business practices. My first programming language that I was truly fluent in was C, and now I wonder if that perhaps wasn’t a disadvantage, as I see how languages have moved more and more dynamic in the last decade. However, my favorite languages are more dynamic, like Perl and Python, which provide me with the power of the REPL, but a syntax that is more comfortable than that of LISP. I really do need to study LISP further, however.

The book isn’t really about the advantages of LISP, and the fact that languages appear to be implementing more and more LISP-isms every day. While that discussion was interesting, Mr. Graham offered interesting insight on the social and psychological effects of intelligence, and of how the Computer Revolution has changed the nature of business. He opens with the papers of sociology, in which he is highly critical of the modern education system, likening it to the prison system in this country. To a certain degree, I understand what he’s referring to. Young people are not particularly valued in this country, relegated to menial tasks and a daily situation where we’d been forced to create our own society which was anything but a meritocracy (unless you count the merit of athletic ability). His observation that perhaps the reason we require kids to read The Lord of Flies was to try to show students that the society we’d created was dangerous was, if true, definitely lost of me and my contemporaries, but I sometimes wonder if my High School experience was less negative than Graham’s.

More interesting was the insight that Graham was unique in being able to offer between Hackers and Painters, how great Hackers tend to resemble, at least mentally, to the Great Painters of the past. I’m apt to agree with that, but I think that it goes beyond Hackers and Painters to the more general class of people that are now being called Makers. Makers are are people who simply wish to create, whatever their medium is. The need to create is within everyone, but for some people it’s such a driving force that it appears to be the dominant aspect of their personalities. And Makers are always the most critical of their own work. If Michaelangelo were alive today, he could no doubt point out every last flaw in his work on the Sistine Chapel, just as a master carpenter can see flaws in every piece he’s ever constructed, flaws which no others can see.

If you’ve ever thought about starting your own company, you owe it to yourself to buy this book. Graham acknowledges that he was one of the lucky ones with Viaweb. Still, he acknowledges that Viaweb succeeded because they did things differently than so many companies that came after them. They did, where others promised. They had little in Venture Capitalist money, which they didn’t need because of their release- early-and-release-often design philosophy. They would sometimes push out half-a-dozen of features and dozens of bugfixes in a single day, simply because they depended upon their swiftness to market and the feedback they’d receive from their users.

While Graham’s business experience is particularly well suited to the web, anyone looking to start a business should consider his lessons. He’s absolutely right, that small companies have a huge advantage due to their lack of overhead. Tech Support guys could always walk down the hall to talk to the programmers when a customer was on the phone. He’s honest, small companies win because they move faster than big companies, big companies win because they can outspend (or simply buy) small companies. Startups are risky, most fail with nothing to show for the effort, but those that win, win big.

Even if you don’t agree with everything Paul Graham says, which I don’t think you should, his essays are still well thought out and written, and his points are worth reading. The book is fantastic, and a must read for anyone who’s even entertained the thought of founding a high-tech company in this day and age.

On the Internet, Not All Arguments are Strings

The Internet is not strongly-typed. It never has been. On our forms, we take in an enormous variety of data: strings, dates, times, numbers, even files. However, due to the nature of HTTP, that data transfers from client to server as ASCII text, which the server will most often happily convert into Strings. Regretablly, this leads to any number of problems when dealing with that data back on the server.

I’ve long held the view that any data that comes from the user should be treated as suspect. HTTP is a stateless protocol, so we have to trust some data to the User, but we must strive to minimize what they user is able to manage, with proper access controls. The Session objects support by most modern web platforms are a help in this, as they allow us keep the real data on the server, under our control, while having a mechanism to identify who the users are. Changing session variables is typically so inconvenient, that most users wouldn’t even dream of doing so. However, I have encountered web applications where the cookies being used to maintain session values were easily identifiable, and easily guessed. I know of one e-commerce site where a cookie with the shopping cart id is sent to the user, and the user can change that cookie to view any historical or current shopping cart. This same site, until fairly recently, controlled applciation access to the administrative interface based on cookies that were sent to the user. Therefore, any user who had logged in could choose to change their access level. Actually, if you knew the name of the keys set, you could bypass the entire login process.

Unfortunately, it seems that many web developers implicitly trust the data coming across the wire. Shopping Sites that took the price-per-unit from a hidden field on a page, etc. The trick to web development is that you should only provide the user with the minimal amount of data that you need to identify their intent on postbacks. Anything else needs to come from an authoratative source, which never includes the user. I suspect the reason for some of these design decisions is to reduce memory usage (keeping sessions small) and database accesses (which takes time). Unfortunately, more input from the users machine provides more potential attack vectors, more data over the wire, and more opportunities for errors to occur. Plus, it requires more work to validate input from the user than it does to pull that data from a reliable source. The few milliseconds that might be saved by serving up input to the user and pulling it back in, simply don’t justify the security holes.

Stepping away from the admittedly horrible software that I’ve been using as examples, the fact that all data is transferred between client and server as ASCII has some interesting trade-offs. Since everything is a string in HTTP, we are required to do a lot of parsing at the server side, which leads to it’s own challenges. If HTML Forms were strongly typed (something that is closer to being possible in HTML 5), than these Web Frameworks that we work in today could be extended to save us from dealing with many issues that we face today. I wouldn’t have to worry about dates, for instance, if the Web Browser was supposed to convert them to ISO format before submitting them, if they failed to parse through the framework, than that becomes not-my-problem anymore.

Regrettably, we’re so far gone at this point, that developers are going to be forced to deal with these sorts of issues for as long as the Web as it exists today persists. Many users simply don’t see the need to keep their browsers up to date. Most Web Developers will still support IE6. It’s ironic that many like to tout the Web as the best mechanism to provide a homogenous experience to users when there is so little commonality between the major browsers. CSS and Javascript all have to be tweaked based on the browser that is being used by the user, creating an unacceptable challenge for developers. Flash and Silverlight mitigate this by running in sandboxes where they are able to provide far more control over the execution space.

Since the Web is not strongly typed, should we use strongly-typed languages to process it? Of that, I’m not so certain. Most of my web-experience has been with dynamic languages like PHP and Perl. Frankly, I’ve always liked that experience. I don’t have to worry about catching exceptions or other error conditions during the program setup, but I do have to do a lot of more work on the back-end to verify the data. Lately, I’ve begun working in the ASP.NET MVC Framework, writing C# code. The impression that I’ve been given in the work I’ve done so far, is rather than describing expected form data, the data is simply placed in a dictionary of strings, dropping all the benifits of type-checking up front, and requiring me to do a lot of type-casting that feels strange.

What we need, is a strongly typed language for the web. In this system, on our form callbacks, we would define the properties (Get and/or Post) that we were accepting from the user. The user could send more data, but we would simply ignore it and not make it accessible to the rest of the package. This gives us strong-typing up front. The challenge is how we deal with invalid form information. For instance, if we’re expecting a numeric value, and recieve an alpha string, what shuld the runtime do? I’m not sure what the answer is for this. I would consider registering callbacks with the arguments that would define the failure conditions. These callbacks could perform further attempts to parse a value, or simply a mechanism to hijack the control flow so that you can do error reporting. Incidentally, a standard MVC framework works really well for this sort of circumstance, as the error handler could simply pass control off to a “Error View”, in a seamless manner to the user.

It would not be particularly difficult to implement such as system, the only difference would be that before any arguments were parsed, something would need to register the types and names first. This would likely be easy to put into the ASP.NET MVC framework (you could do this in Catalyst as well, but Perl isn’t strongly typed, so why bother?), as Microsoft has done a very good job so far of making their framework extensible. However, until we have good standards on the client side for standard data representations, I think the entire discussion is largely academic.

The web wasn’t designed for applications. And while HTML 5 is trying to correct some of these limitations, it will be hard to support, as developers won’t necessarily be able to depend on their users using a compliant browser, and the attitude in the web is one of inclusion. Which it should be, denying users is always a risky proposition, as people who have negative experiences are for more vocal than those who have positive ones. Choosing not to support a browser any more is dangerous, even if that browser is five years old. Some people just don’t like to change. Hopefully some day we’ll be ready to move forward with the web.

Foolish Language Decisions

I’ve been blogging quite a bit lately about C#, .NET, and Mono. And by large, I love the technology. ASP.NET was never that interesting, it feels pretty restrictive to guys like me who know HTML fairly well, but the rest of .NET is a solid technology that performs far better than alternative VMs at this time. With the DLR, it even does dynamic languages, cleanly providing the features of parrot but combining it with non-dynamic code in an interesting symbiosis.

I even really like C#. It’s a language with features resembling the advanced features of C++, with the few nice features that came out of Java. The new features added in C# 2.0 and 3.0 make the language even better, though they do show that Paul Graham was right when he said that as languages advance, they become more like Lisp. Lambda expressions, Delegates…it won’t be too much longer before we get an eval command in C#. Won’t that be an interesting day?

Still, despite the improvements, there are a few things about C# that just don’t make sense.

First, the switch-case statement. Unlike C/C++, case statements do not explicitly fall through to the next case. They must either break or goto another case label. However, they are required to one or the other. The break, which could easily be implicit under the rules of the grammar was specifically made explicit. The argument seems to be that they didn’t want to give C/C++ programmers the wrong idea when they were reading C# code. I think that the average programmer is clever enough to remember such a trivial (and logical) change between two languages with minimal confusion. Break certainly shouldn’t be disallowed, but there doesn’t seem to be any good syntactic reason to require it.

For the second, C# 3.0 now supports the ‘static’ keyword on classes, which is used to imply that all the methods for the given class is static. Why then, do I still need to declare each method as static? Clearly, by declaring the class static, I’ve already declared my clear intent that every member of that class be static as well. It’s a waste of time to require that I consequently declare every last member of the class as static as well. Just stupid.

Overall, C# is a really good language. It’s not the most powerful language out there (Perl has that by miles, and of course Lisp, but I don’t use it regularly), but it’s good. These few nagging things really are little more than that, nags. Still, there is no good reason why these syntactic artifacts have persisted into this language, and specifically it’s third revision.

Boise Code Camp 2008

This last weekend I got the opportunity to travel down to Boise, ID for their third annual Code Camp. Code Camps are a relatively new activity, largely supported by Microsoft (though non-Microsoft talks are welcome). It’s an opportunity to spend a day attending sessions given by software people, for software people, so that we all can take away something to become better coders.

I traveled down to Boise with a few people from work, which allowed us to split up and watch a bunch of different sessions. We also took a few cameras so that we could record some of what we saw, both for our own records, and possibly to make available for others.

I was happy with most of the sessions I attended. Scott Hanselman’s keynote was fantastic (no video from us, hopefully someone got some). Not very informative, but it wasn’t really meant to be, and it was really, really funny. Almost as funny was the tripod we were able to almost magically provide when they were having trouble getting a projector set up so that everyone could see it.

After the Keynote, I went to a talk on using Adobe’s Flex framework, and how one company is using it and Perl to quickly write applications for a variety of customers. Flex is a RAD system for developing Flash applications using an XML format, to provide a richer application experience through the web browser than AJAX can at this point. And, with Adobe’s new Air, the same application designed for the web, can be built for the desktop, even with support for a local data store. It’s an interesting platform, and one that I may need to investigate more.

My next session was on Testing with Rhino.Mocks, a framework that allows you to easily “mock” objects, as well as configure what expected usage is, in order to test all your assumptions of how a certain part of a program will be used. The platform looks interesting, though an hour was barely long enough to get a good look at the system. Still, as I move closer toward writing unit tests first and code second, it looks like an interesting tool.

My next two sessions were terrible. I went to a session on MVC by a guy who designs web-based software for the government. The talk was really, really lousy. The presenter seemed as if he was going to drop dead from a stroke at any moment, and never really seemed to get to the point of what MVC was, why we would want to use it, and what benefits it held. I spent an hour in this room, and got nothing out of it. Then, I went to a session on Service Oriented Architecture, which I quickly realized I understood pretty well, and the presenter had no idea how to present what he was trying to talk about.

The next few hours were more useful. I attended a session on Workflows in Sharepoint, which also provided a decent introduction into Windows Workflow. It’s a technology we’re planning to use a bit at work, and it appears an interesting mechanism to track a piece of work through a series of steps until it reaches a completion state. The technology is interesting, though I fear that it might be a little too restrictive. If the trade-off in restrictions is worth the time saved, I haven’t determined yet, but it will be interesting to pursue.

Wrapping up with a session on C# 3.0 and .NET 3.5 and Practical Hashing (which was heavily password focused, and not as meaty as I’d hoped), the session ended. It was an interesting day, though I didn’t walk out of the session with as much as I’d wanted. The technology was very Microsoft-centric (though it didn’t need to be), though several of the bits of technology will work in Linux development too.

I’m still kind of tired from the trip, but I hope to talk a bit more about some of the technologies we learned about. Rhino.Mocks in particular will be next on my to-learn list, and I’ve already been working on MVC systems.

Memory Attacks Addendum

Back in 2006, Adam Boileau, a New Zealand security researches and consulting announced an attack that let a Linux computer read another computers memory over firewire. His initial attack was targeted toward Windows, and allowed an attacker to unlock a locked system or bypass a login prompt. After two years, the attack still works, and Bolieau decided to post the code to the attack.

While Bolieau only approaches this from the direction of unlocking a locked system, this would work equally well as an alternative to the memory swap trick discussed recently. However, in this case, almost every Operating System is equally to blame for this, as it is the OHCI-1394 Specification which dictates this as acceptable behavior.

Firewire devices are supposed to be given direct memory access in order to ensure good performance, which is an admirable goal. Firewire devices are designed to move an enormous amount of data, and people want it to move quickly. Unfortunately, spoofing a valid device in order to get full illegitimate access to a systems memory is trivial.

Bolieau was not the first person to approach this danger of Firewire, and actually Linux, Macs and BSD were attacked successfully with this before Windows was due to a technical issue (basically, only Windows required you to claim that you were allowed DMA access, a trivial thing). Still, why was Firewire designed this way in the first place? And why do the majority of users, who don’t even use Firewire, have this dangerous behaviour enabled by default?

All I can figure is that the developers of Firewire were hardware geeks more concerned with high performance than with security. Provide DMA access, sure, but relegate it to a negotiated size and span of memory, not completely unrestricted access to everything. Hell, the OS and CPU aren’t even involved in this transfer. Mac BIOSes take advantage of this by allowing you to boot a Mac into a special “disk” mode where they act just like a firewire disk, great for when you need to get data off a Mac who’s hardware is starting to fail.

While this is a great forensic tool, since Memory images are typically hard to get from a running system, it’s also an interesting security risk. On the one hand, it’s just another example of physical access meaning complete access, but it’s also one of the most effective data gathering techniques I’ve seen, in that it only requires plugging in a simple cable.

These tools are already a part of my incident response toolkit, and I suspect they will remain so for some time.

Microsoft's Techfest

The online buzz in being drowned out by MiX, where Microsoft has made some not-terribly-surprising announcements (IE8 and Silverlight 2 Betas), but also occurring recently is Microsoft’s Techfest, a internal show Microsoft puts on where their researchers get to show off the cool stuff that they’re working on. Channel 9 had an opportunity to record some stuff with a few of these teams. And some of the work was pretty cool.

One team, is working on a music suggestion system that actually ‘listens’ to the music to determine what kind of music it is, in order to determine what similar music would be that a user might find interesting. It’s really interesting AI research, that completely removes the need to have people tag music. In fact, after the initial training of the AI is done, human intervention basically becomes unnecessary. For streaming music services, this is a potentially huge cost-savings development, as well as a system that could greatly help the exposure of new artists, by making the suggestions coming out of these services more reliable and useful.

One team has done a lot of work on Field Programmable Gate Arrays, which could help processor design immensely, as this is a field that has sort of stagnated in recent years. A series of household automation sensors using SOAP-based communication and Wireless TCP/IP that will run on 2-AA batteries for 2+ years. As someone who’s given a lot of thought to household automation with something like MisterHouse, this appears to be an excellent alternative to expensive, and flaky X10. I just hope it’s ready soon.

There is some interesting research occurring to try to make it possible to generate statistical data for projects like the Netflix Prize, without the almost guaranteed disclosure of personally identifiable information. The system is based on MS SQL Server (of course), and it exposes an API where people can submit requests to you, and returns the results of the query, sans any identifiable information. So, Netflix wouldn’t have had to give everyone a “scrubbed” database of actual user preferences which could be linked against IMDB, rather people would submit their queries, and get the statistical data they needed back cleanly. For certain types of queries, for instance for the Locality queries supported in SQL Server 2008, it will institute a bit of a ‘fudge factor’ to try to reduce the possibility of revealing information. For instance, if you were Amazon, and you wanted to see the shipping addresses of all the DVDs sold, for localities where the data is sparse, the data returned will not be as accurate (possibly township level instead of street level) to prevent inadvertent disclosure. I’m not sure how well it will all work yet, but the system is definitely interesting.

Finally, there are some collaborative search technologies that Microsoft is working on, including one where you can store your search data (and notes on searches) on a Microsoft service, and bring in other people to participate in the searches with you, including the ability to see what searches they’ve done, what sites they’ve visited, and notes they’ve made. Another system to allow people to share a single computer to do research, using multiple-mice and cellphones as the input devices. Of course, these applications only work with IE and Microsoft’s Live search (currently), so even when Collaborative search is made available, I won’t be able to use it. But the idea is an interesting one, and it would be possible to create an interesting mash-up that did the same thing in a browser and search engine agnostic manner.

As I’ve said recently, Microsoft research seems interesting, and while I’m not sure about the way some of these technologies will be made available, or even if they’re really viable, the support Microsoft has provided it’s research teams to go out and create is phenomenal. I wish Techfest had received more coverage, because I’m sure there was plenty more that was interesting happening, and that probably would have been more interesting that yet another demo of Silverlight 2 and Expression Blend studio (with apologies to ScottGu).

Miguel de Icaza on Channel 9

Channel 9 is a relatively new website from Microsoft that features webcasts of interviews from the trenches at Microsoft, that range from exposes of individual employees to discussions of new technologies from Microsoft to general discussions of Computing theory. It’s all very interesting, and it shows that while myself and others may distrust Microsoft for their business practices, the people creating the technology are really passionate about. And Microsoft is much more willing than they used to be to release new technologies, like F#, as open technology.

Channel 9 has been an entertaining resource to listen to while I work, and see some of what’s coming from Microsoft without a lot of the marketing angle that you get from other sources of Microsoft information. Incidentally, Microsoft has also formed a new site, Port 25, which focuses on Open Source issues and Microsoft.

When Miguel de Icaza, founding member of the GNOME and Mono Projects, was at the Lang.NET 2008 Conference in Redmond, he was asked to appear on Channel 9 to talk about Open Source, Mono, and Moonlight. The interview, which is alongside Dragos Manolescu (a member of Microsoft Live Labs), where Miguel talks about the what Open Source means, and about why so many people in the Open Source world dislike Microsoft, and why their feelings are perhaps misplaced.

While I tend to disagree with Miguel when it comes to Politics, and I was really, really leery when he founded Mono, I’ve always respected his ability as a hacker and as a leader of open projects, and over time, I came to quite like .NET and Mono. He holds himself well, and despite his seeming inability to remember a lot of the pop-culture things he wants to reference, behaves sensibly and intelligently. I love Open Source Software, I see the value in the GNU and the GPL. But the work of the Free Software Foundation is often overshadowed by the zealotry inherent in many of it’s members, which simply isn’t shared by most of the community.

Most of us prefer Free/Open Source software, and will seek to use it over proprietary software, but as Miguel says, Proprietary software has a place, and it’s really great to see companies like Microsoft beginning to investigate and even embrace open technologies and philosophies. Sure, I still don’t trust them entirely (check out my comments on OOXML), it’s great to see how the folks in the trenches are really just there for the technology, and that management is once again supporting the teams who are just interested in hacking.

Movable Type Development

In my continued efforts to move my personal site and build my Consulting site in Movable Type, I found that the way that Movable Type handles Pages was problematic for me. Namely, I wanted to be able to have index pages for folders.

The system does not prevent me from placing a page with a basename of “index” in a specific folder, which provides the ‘illusion’ of folder indices, however, that solution left me with a problem. Name, the index page appeared as a subpage of the folder, not as the folder index.

I quickly discovered that Tags that I needed to rewrite the Pages Widget to support the behavior, I wanted didn’t exist, and so I began development on the Rich Folders Plugin for Movable Type 4. Currently, the plugin consists of only two more tags, one to determine if the current folder has an entry named “index”, the second to tell me if the current page has a basename of “index”. The next version of this plugin will support other basenames that might be encountered (such as “homepage”). I am debating the need to make this list be editable by users of the plugin.

While developing this plugin has got me part of the way to completing moving my personal site to MT, it has forced me to become far more familiar with the internals of Movable Type than I expected that I would need to for such a simple plugin. Looking back, I suppose my shock during the process wasn’t how much I needed to learn about Movable Type, but what felt like a general lacking in the developer documentation.

Never in the Developer Documentation does the author of that document go into important details, like how to load a list of entries, or what you can expect in the system Stash at different points in the execution cycle, or how to return errors to the user if your plugin is used incorrectly. These reasons are part of why I have not yet published my plugin, because I am still going through ensuring it is more robust and ready for general consumption. Plus, I think it could use some more functionality, much of which will likely become apparent the more work I get done.

It seems to me that Six Apart depends on a healthy developer’s ecosystem around the Movable Type publishing platform. Movable Type is an amazingly flexible system, in that the framework can be extended to do a number of different things from Blogs, the standard websites, to Forums. However, that flexibility is very hidden. The standard installation is highly blog-centric, and I’ve had to dig rather deep into this CMS in order to move a relatively simple website to it, which is unacceptable.

For instance, there does not appear to be any way to add static content to the root of the site without editing the index template. This deprives me (or any user) of the ability to use the rich text editor to maintain that data, instead hiding that particular page away in the Templates subsystem, and forcing me to edit that HTML directly. What is to be done? I’m expecting that the next part of Rich Folders will be a function that can grab the contents of the index in the blog’s root to display that to the users.

Ultimately, I’m likely to change the name of Rich Folders, as really my goal with this plugin is to expose functions that make Movable Type easier as a simple CMS. Movable Type is a great platform. I love that I can have a website which doesn’t need to destroy my hosting provider’s database in order to operate, but can still be maintained easily and from anywhere. If only the developer documentation were better, I’d likely already be done. I love Perl a a language, but I am forced to acknowledge that Perl code you didn’t write yourself is not unlike translating hieroglyphics without possessing a Rosetta Stone.