LINQ and Premature Optimization

s a Man, the word ‘Premature’ sends shudders down my spine. As a Developer, it does as well, for additional reasons. Every student of computer programming has been taught about optimization at some point. Whether it’s been learning assembler for the hand-tuning of certain methods, to moving variable declarations outside of loops to save cycles, we’ve all been taught it. And at least for the last several decades, we’ve had the words of Donald Knuth from the December 1974 issue of the ACM Journal Computing Surveys thrown at us.

Premature optimization is the root of all evil.

And he’s right (maybe). And a lot of people have pointed that out. Luckily, we live in a day where compilers can often do a better job of optimizing our code than we can. These days, if we’re having a performance problem, we probably have a problem in our algorithm, something that requires changing (not tweaking) the algorithm. Note: I’m not including performance problems relating to bugs here, that’s a different problem.

Unfortunately, sometimes optimization done by the compiler can be premature and broken. I have encountered such cases in LINQ, and needless to say, I’m not terribly pleased by it.

In our database, we have our permissions split up among three different tables (representing the three different types of permissions). I’m not going to argue about whether or not this is the best way to have done this, I’m probably going to be changing this behavior, but it’s what I have right now, and that small refactor isn’t important enough to change right this instant. It did, however, lead to the following problem. But first, the schema:

DBSchema.png

As you can see, each set of permissions requires just a bit more data to qualify it. However, I found (at least) one case where I wanted to combine the records from each of these tables into a single return from a web service, using the following structure:

    public class PermissionRecord
    {
        public UserRecord User { get; set; }
        public PermissionType Type { get; set; }
        public String Campus { get; set; }
        public int Year { get; set; }
        public int Term { get; set; }
        public string Prefix { get; set; }
        public int? MinCourseNumber { get; set; }
        public int? MaxCourseNumber { get; set; }
    }

For those unfamiliar with newer versions of C#, the ‘?’ after a variable type means that the type is ‘null-able’. Obviously this only works on primitives, which couldn’t be null before. Anyway, back onto the code. I wanted to fill in a List of these with records from PermissionsPrefix and PermissionsCourse. But, as is shown in the schema above, MinCourseNumber and MaxCourseNumber don’t exists is Permissions_Course and should be set to null. Sounds easy enough, right? That’s what I thought, which led to the writing of the following LINQ query.

(from t in Permissions_Prefixes
 join u in UserNames on t.UserWsuId equals u.WsuId
 where t.Campus == "Pullman" && t.Year == 2009 && t.Term == 1 && t.Prefix == "A S"
 select new PermissionInfo()
 {
    Campus = "Pullman",
    Year = 2009,
    Term = 1,
    Type = 4,
    Prefix = "A S",
    MaxCourseNumber = null,
    MinCourseNumber = null,
    User = new
    {
    UserId = t.UserWsuId,
    FullName = u.FullName,
    EMail = u.EMail
    }
    }).Union(from t in Permissions_Courses
                     join u in UserNames on t.UserWsuId equals u.WsuId
                     where t.Campus == "Pullman" && t.Year == 2009 && t.Term ==  && t.Prefix == "A S"
                     select new PermissionInfo()
                     {
                        Campus = "Pullman",
                        Year = 2009,
                        Term = 1,
                        Type = 8,
                        Prefix = "A S",
                        MaxCourseNumber = t.MaxNumber,
                        MinCourseNumber = t.MinNumber,
                        User = new
                        {
                            UserId = t.UserWsuId,
                            FullName = u.FullName,
                            EMail = u.EMail
                        }
                    })

Fairly straight forward union. Request all the relevant permissions out of PermissionsPrefix (setting the Mix and Max course values to null), and union them with the relevant records from PermissionsCourses. This looked to me like it should work, no problem. Unfortunately, it produces the following T-SQL statement:

SELECT [t4].[value] AS [Campus], [t4].[value2] AS [Year], [t4].[value3] AS [Term],
            [t4].[value4] AS [Type], [t4].[value5] AS [Prefix], 
            [t4].[value6] AS [MaxCourseNumber], [t4].[value6] AS [MinCourseNumber], 
            [t4].[UserWsuId] AS [UserId], [t4].[FullName], [t4].[EMail]
FROM (
    SELECT @p4 AS [value], @p5 AS [value2], @p6 AS [value3], @p7 AS [value4], 
                @p8 AS [value5], @p9 AS [value6], [t0].[UserWsuId], [t1].[FullName], 
                [t1].[EMail]
    FROM [Permissions_Prefix] AS [t0]
    INNER JOIN [UserNames] AS [t1] ON ([t0].[UserWsuId]) = [t1].[WsuId]
    WHERE ([t0].[Campus] = @p0) AND ([t0].[Year] = @p1) AND ([t0].[Term] = @p2) AND ([t0].[Prefix] = @p3)
    UNION
    SELECT @p14 AS [value], @p15 AS [value2], @p16 AS [value3], @p17 AS [value4], 
                @p18 AS [value5], [t2].[MaxNumber], [t2].[MinNumber], [t2].[UserWsuId], 
                [t3].[FullName], [t3].[EMail]
    FROM [Permissions_Course] AS [t2]
    INNER JOIN [UserNames] AS [t3] ON ([t2].[UserWsuId]) = [t3].[WsuId]
    WHERE ([t2].[Campus] = @p10) AND ([t2].[Year] = @p11) AND ([t2].[Term] = @p12) AND ([t2].[Prefix] = @p13)
    ) AS [t4]

Take a close look at that query. Notice how in the outer select both MaxCourseNumber and MinCourseNumber are being set to value6? And that the first union select, has 9 items and the second has 10? Needless to say, this doesn’t work. Since both Min and Max Course Number are being set to the same value, LINQ decides to try to optimize the select by only selecting the value once and assigning the value twice. Oh, easy fix, I hear you say, just switch the two queries, and put the request from Permissions_Course first!

SELECT [t4].[value] AS [Campus], [t4].[value2] AS [Year], [t4].[value3] AS [Term],
    [t4].[value4] AS [Type], [t4].[value5] AS [Prefix], 
    [t4].[MaxNumber] AS [MaxCourseNumber], [t4].[MinNumber] AS [MinCourseNumber], 
    [t4].[UserWsuId] AS [UserId], [t4].[FullName], [t4].[EMail]
FROM (
    SELECT @p4 AS [value], @p5 AS [value2], @p6 AS [value3], @p7 AS [value4], 
        @p8 AS [value5], [t0].[MaxNumber], [t0].[MinNumber], [t0].[UserWsuId], 
        [t1].[FullName], [t1].[EMail]
    FROM [Permissions_Course] AS [t0]
    INNER JOIN [UserNames] AS [t1] ON ([t0].[UserWsuId]) = [t1].[WsuId]
    WHERE ([t0].[Campus] = @p0) AND ([t0].[Year] = @p1) AND ([t0].[Term] = @p2) AND ([t0].[Prefix] = @p3)
    UNION
    SELECT @p13 AS [value], @p14 AS [value2], @p15 AS [value3], @p16 AS [value4], 
        @p17 AS [value5], @p18 AS [value6], [t2].[UserWsuId], [t3].[FullName], 
        [t3].[EMail]
    FROM [Permissions_Prefix] AS [t2]
    INNER JOIN [UserNames] AS [t3] ON ([t2].[UserWsuId]) = [t3].[WsuId]
    WHERE ([t2].[Campus] = @p9) AND ([t2].[Year] = @p10) AND ([t2].[Term] = @p11) AND ([t2].[Prefix] = @p12)
    ) AS [t4]

No dice. The outer select is correct now, but the second inner select (from PermissionsPrefix) still isn’t selecting value6 twice, and it’s saving it as value6, when clearly the outer select is expecting properties named MaxNumber and MinNumber. So, these unions are completely broken because LINQ is trying to optimize them independently of one another, which clearly doesn’t work. Not only that, but the optimizations are only half done. They’ll gladly try to combine values in the select part of the statement (if I were to change the ‘null’ to ‘1’ in the query against PermissionsPrefix, it would combine it with the Term value), but they insist on sending in multiple parameters with the exact same values in other parts of the query. For instance, the value for Term is the same in both the Where clause and the select clause, but it’s clear that the SQL generated is sending the parameter in with two different names. This happens in the same query, not just across the two queries.

Why is this such a big deal? Simply because I now am forced to separate this out into two distinct LINQ queries, which each ToList() inside of .NET. Once I’ve done that, I have to then append the results of the second query onto the first list, resort the list how I want it, and dump the data back across the wire. Essentially, I have to redo in .NET a lot of things that SQL is specially tuned to do. Now, in this case, the number of records we’re dealing with is trivially small, and as we know, everything is fast for small n.

I fail to understand why Microsoft has built some optimizations into the LINQ processor and not others. Admittedly there are times when optimization is difficult because it’s not always clear when you should optimize the SQL. Sometimes (as in the query above) you can optimize ahead of time. Other times, you’ll need to wait until runtime, but that can always be determined by code-analysis. And if you’re not sure, you should always err on the side of caution, and not combine my select elements simply because a cursory glance suggests it might work, because there is a very good chance it won’t.

Premature Optimization may be the root of all evil, as it can lead to a large amount of wasted developer time, but as a developer I should be able to trust my compiler to not change my code in a breaking fashion. LINQ is a great tool, but it still has some issues I really hope Microsoft corrects for .NET 4.0.