DTD Compliance on the Web: Worthwhile?

By on December 2nd 2008

The World Wide Web Consortium (W3C) has spent a lot of time over the past several decades on trying to ensure that there is a reasonably consistent means for content to be delivered via the Web. This has included HTML, CSS, Accessibility Guidelines, PNG, SOAP, the DOM, and much, much more. Today, I'm going to be focusing on the HTML and XHTML standards, as they are almost, but not quite, interchangeable, and both are in common use on today's web.

The current two kings of Web Markup are HTML 4.01 and XHTML 1. Of course, it's not quite that simple, because each one of these define their own 'flavors' inside of themselves which present slight differences between two XHTML pages, for instance. Most people, myself included, tend to favor the "Transitional" models, as they're a tiny bit more flexible, given that they support elements that the W3C wants to phase out (in favor of CSS). I completely support the W3C in this, but I occasionally use modules don't follow the Strict Document Type Definition, so I tend to stick to Transitional.

Because pages can use different document types, pages should tell the browser what they're trying to accomplish, in order to make the browsers job of rendering easier (and hopefully improve the likelihood that the browser will render the page correctly). These Document Type Definitions (DTDs) look as follows:

This DTD, which is the same one at the top of all these blog posts, tells the browser that I'm sending it XHTML 1.0 Transitional, and provides a link the DTD which can be used to verify that my XHTML is well-formed. The question is, is it important that your documents be well formed?

Purists, of which I consider myself one, argue that, yes it should be. Well formed data is easier to parse, faster to parse, and easier to present in a correct manner. The problem is, that not every browser always agrees on what the 'correct manner' is. This is improving, but Internet Explorer in particular will occasionally take well-formed HTML and mangle it in some pretty extraordinary ways. To be fair to Microsoft and IE, these days these are generally problems with the CSS engine and not the HTML parser, and IE is not the only offender, but it's still a problem that Web Developers are constantly having to deal with.

The biggest problem, really, is that for many years people have been presenting browsers with poorly formed HTML, and rather than fixing the page (the easy solution), browser makers went through incredible lengths to fix these strange bugs. After all, as far as the user is concerned, it's the browsers fault when something doesn't render correctly, not the markups. Actually, this attitude in changing, but poor standards compliance, and a multitude of custom extensions have made the job of the web developer far more difficult in the amount of testing that needs to be done to ensure a site renders correctly (let alone that it's JavaScript works as expected).

Because of decades of legacy sites that won't be updated, but people still expect to work, the web browsers aren't likely going to stop guessing at what malformed HTML means anytime soon, nor should they. But, I believe it is in our best interest to build the most standard's compliant markup we can. First, pages load faster when the browser doesn't have to guess, but also because HTML is first and foremost a data format, and I want to see data well-formed.

On the other hand, there are sometimes reasons to consider not following the standards. For instance, on this site, I use the dp.SyntaxHighlighter JavaScript widget to do code highlighting. It works really well, and I like it. Unfortunately for me (anal retentive bastard that I am), it doesn't follow the XHTML 1.0 standard. Specifically, it uses the 'name' attribute on either the 'pre' or 'textarea' tags to define areas it should operate on. Unfortunately, the 'name' attribute is not valid on the 'pre' tag, and I want to style the textarea's, but I don't necessarily want to style all textarea's the same.

What I wanted, in truth, was to move the value I was specifying in 'name' into the 'class'. It's fully possible to define multiple classes on a given tag, so I went ahead and started hacking it. The code was fairly simple:

        function hasClass(ele,cls) {
            return ele.className.match(new RegExp('(\\s|^)'+cls+'(\\s|$)'));
        }
        function FindTagsByName(list, name, tagName)
        {
            var tags = document.getElementsByTagName(tagName);
            for(var i = 0; i &lt; tags.length; i++)
                if(tags[i].getAttribute('name') == name || hasClass(tags[i], name))
                    list.push(tags[i]);
        }

I'm considering to rewrite this code using YUI 3, since I already use YUI and I can cut the code complexity quite a bit by refactoring it, but it's probably not worth the time that would take. My change above is simple, just checking the class in addition to the name attribute, but I also had to filter out the class name below in order to ensure that the correct brush is called:

                options = options.split(' ');
                options = (options[0] === name ? options[1] : options[0]);
            options = options.split(':');

It's a bit naive, particularly since it will only allow two classes (or rather, assumes that the language/arguments class is the first or second argument), which I'll plan to correct later. But it is working, as you can see on this page.

Was this worthwhile? Perhaps not. This is an example of a time where the DTD is perhaps too limiting. Is it really necessary to prohibit me from using the 'name' attribute on a 'pre' tag? Maybe, maybe not. I'll leave that up to you.

Unfortunately, we are at a time when the HTML standards are woefully out of date. XHTML 1.0 was last revised in 2002. HTML4.01 in 1999. Sure, HTML 5 is in active development, but sometimes the question needs to be asked: Is the standards compliance worth giving up a given feature? For some modern JavaScript/DOM/AJAX based applications, that answer might occasionally be yes. But, that doesn't mean that Standards Compliance shouldn't be the goal, with well documented reasons for choosing not to follow it.

If you're interested in my updated version of SyntaxHighlighter, and the XHTML 1.0 Transitional compliance it allows, please check out the class-support branch of my github repository for the project. All my changes are released, as was the original code, as the GPL.