Bad Design: Microsoft Virtual Server

At work, we’ve been moving quickly in the direction of Server Virtualization. There have always been benefits to reducing the attack exposure of a server by minimizing the number of services running. This has traditionally been immensely impractical. Servers take up space, use quite a bit of electricity. The fact has always been that most servers sit around mostly idle, most of the time. So, given that many servers usually aren’t working very hard, why not combine many servers into one? This is exactly what Server Virtualization is all about.

This space has traditionally been dominated by VMWare, though several Open Source options, like Xen have become available in recent years. On the desktop, I’ve been using VirtualBox OSE, which does a decent job, at least until I can justify the purchase of VMWare Workstation (which frankly, isn’t too expensive at ~$200). Between these Open Source Tools, and Microsoft’s Virtual Server and Virtual PC both being free to licensed Window’s users, it will be interesting to see how long VMWare can keep their prices where they are.

To be fair, VMWare is still the superior product, but the cost savings, and our SysAdmin’s tendency to prefer all things Microsoft, led us to use the Microsoft solution. And it’s worked out pretty well for us. Virtual Servers are easy to create, backup, and redeploy. The only problem we’ve had is that these operations take a long time due to the fact that we’re not using a NAS device, but the software has done pretty well for us despite this. 10+ hours to transfer a 150Gb image over Gigabit Ethernet pretty much sucked, but it did work.

So, why is this software being palled for Bad Design? Simple. Deploying a Virtual Server image out of the backup, deletes the backup. Without any option to do otherwise. Our Sys Admin has been in the process of rebuilding all of our servers with Windows Server 2008, and last weekend was his opportunity to rebuild the Virtual Server Hosts (I’m not sure why we’re not using Hyper-V, don’t ask). The rebuild went fine, aside from the re-deployment of some of the Virtual Servers being slow, but again, a NAS will fix this, and it’s an in-progress purchase.

Due to the staggering amount of time it took to do the redeployments, immediate backups were not performed. It was assumed that waiting until this weekend would be fine. Anyone who has done Systems Administration knows what happened next.

The Virtual Server Host lost the largest virtual server. The only one that wasn’t a part of any standard backup scheme, because we had been told that it was for temporary storage of image data as it was being cataloged. It was being used for more. Much more. Attempts were made to recover the images, including running several undelete tools on the server in question. The only thing not done, was immediately taking the server offline and imaging the drive for analysis and possible recovery. The Sys Admin felt it wasn’t necessary, and I lost a much sought opportunity at forensic analysis. :(

Sure, we should have had that backup. However, if best-practice dictates that you should immediately backup a virtual server deployed from the vault, why does the software delete the version it’s deploying? Who in the hell thought that was a good idea? Drives fail. Software fails. We backup to protect ourselves from that. It was entirely possible that the failure that lost the Virtual Machine could have happened in the window between the image being deployed and the backup being completed, and in that case, who’s fault would the failure have been?

One of the first rules of writing software is that it must be resilient. Due to the decision to move an image while deploying from the library rather than copying it, that server was lost. This is not resilient programming. This is not resilient design. This is not resilient software.

For the most part, our experience with Microsoft’s Virtualization technology has been fine. I had a bit of trouble booting Ubuntu 8.04 inside of it (tip: add the ‘noreplace-hypervirt’ boot option), but it’s mostly worked pretty well. Still, this particular bug is so egregious that I can’t imagine what the person who decided on that behavior was thinking.

Don’t be completely afraid of Microsoft’s tech. But do be clear to do your backups religiously.