AT&T Data Mining Language - Hancock

AT&T has been lambasted quite a bit lately for their willing cooperation with Federal Law Enforcement in Data Mining efforts. Ignoring, for the time being, the moral and issues surrounding Data Mining against mostly innocent people for the purposes of determining Communities of Interest, there is actually quite a large amount of fascinating research that has gone in to this problem space. The initial problem of determining Communities of Interest by tracking who is communicating with whom, is deeply involved in Graph Theory, and I'll be waiting on giving my impressions of that research until a later date. Right now, I'm interested in talking about the programming language that AT&T designed to aid in examining the enormous amounts of data they have access to.

Hancock is a domain specific language designed by AT&T to be an extension to C. It's not too surprising that AT&T chose to base their new language off of C, after all they invented it. Hancock is a stream-oriented language designed for iterating over vast amounts of regular data, based on patterns. Due to it's requirement that data be fairly regular, it's particularly well suited for analyzing information in database, or potentially some log files. While AT&T is using this language to analyze their long distance phone records, internet traffic, and suggest analyzing banking data, I'm going to approach the technology from a significantly less sinister angle, of an e-commerce company who is analyzing what their customers are viewing on their site, for the purposes of populating a "You Might Also Be Interested in..." box on the product pages.

The Hancock Compiler, available free for non-commercial use (I guess that kills my propsed idea, still this is a more academic excercise), is available for download from the AT&T Labs site I linked to above, is functionally a two pass compiler, first it converts any .hc file provided to it from Hancock code to C code, before sending everything (including any pure C files) to the C compiler for final compilation. This means that all of Hancock's special constructs; views, multi-unions, directories, maps, and pickles can potentially be used in normal C programs. The hcc compiler serves as a front-end to the C compiler, and any arguments which it doesn't recognize, will be passed to the C compiler. One interesting option is -C, which causes Hancock to dump the C representation of a .hc file to disk, so you can analyze the file for debugging purposes, or simply to learn how Hancock does what it does. If a person were to create a Open Source version of Hancock, this would be crucial to allowing that work to be done, without viewing the source code for Hancock directly. Of course, AT&T does have Patents on some of the technology used in Hancock, but like Data Mining Ethics, Software Patents will be a discussion for another day.

As Hancock has a very good manual provided by AT&T, I'm not going to write a full tutorial, but I will be posting code blocks in order to highlight some of what I really like about the language, for it's intended purpose. For that code, the following visitRec_t is the type defining the record for a CustomerVisits table, which contains rows detailing all the products a customer has gone to in their web browser, as well as the time they went. The stream declaration allows for the stream to be instatiated, and tells the IO library, Sfio, to map it's input to our new type. For simplicity, each product belongs to only a single category.

typedef struct {
    ip_t address;
    prod_id product;
    cat_id category;
    time_t viewed;
} visitRec_t;
stream visitDetail_s { getValidVisitRecord : Sfio_t => visitRec_t; };

The first main difference from C to Hancock is that a Hancock program begins with sig_main instead of main. In our example, the entry point is

void sig_main ( visitDetail_s visitsStream v <v:> )

I'm following the AT&T convention here, so the _s at the end of the type, indicates that the type is for a string. It's kind of a reverse Hungarian Notation. What's most interesting, is that the stream type, internally, is kind of like a linked list, and data will be iterated over in a similar manner. The coolest part is that the code for opening the file which stores the data stream is handled for you. That bit is shorthand to tell the program that the -v command line option will have the filename to connect the stream to as an argument. We'll get more into streams later, but the fact that all that command line processing is handled for you at the language level is huge.

The meat of the program will be the iterate block, which contains the instructions for how to process through the data. In some ways, this reminds me of LINQ from C# 3.0, but the techniques are different enough that that really ins't a totally fair comparison. The syntax is pretty clear. The following snippet of code is supposed to do something every time a customer visited a product in the BABY_CLOTHES category.

iterate (
    over visitsStream
    filteredby(v) (v->category == BABY_CLOTHES) ) {

    event (visitRec_t *v) {
        // Do something.  Any C function would be valid here.
    }
};

An iterate block can have multiple events of interest. In this case, there is a single, anonymous event, because the way the iterate statement is filtered (ie, by category == BABY_CLOTHES), the only records available to trigger an event are the one's we're interested in. In C's Switch-Case construct, the anonymous event is similar to the default clause. Other than that, the code is pretty straight-forward. over denotes the name of the stream, and filteredby gives instructions on how to limit the search. In C#'s LINQ, this would look something like this:

data = from visit in visitsStream
             where visit.category == BABY_CLOTHES
             select *

foreach datum in data {
    // Do Soemthing.
}

Very similar. Since Hancock has been around since 2000, and C# 3.0 is just on the verge of release, I wonder if LINQ was at least inspired by Hancock. Both serve similar goals, of making it easier to process relational data. If anything, the similarities between the technologies serve as a indicator of how potentially powerful and useful this sort of syntax is. In some ways, I think the LINQ syntax might be easier to follow. Partially because it's written closer to SQL (which is more familiar). Partially because it returns an IEnumberable type which you can do basically whatever with. With Hancock's iterate statement, once the data has been processed.

For the most part, such simple handling wouldn't be very useful on it's own. But what streams are to Input, maps are to Output. Maps are basically hash tables which allow you to define certain parameters. All maps would define the range of the keys, the type of the value, and a default value. Really interesting, the the split directive, which seems to allow you to tell the computer how large of blocks to load at one time. This is interesting in that it provides a lot of power in letting the developer decide how much system memory to devote to a process in order to find a level that minimizes disk accesses. The intricacies of how that works will have to wait until later, but the basics are pretty simple. If we're trying to determine how many visits a particular product has recieved, our map might look like this:

map productViewCnt_m {
    key (MINPRODUCTNUMBER .. MAXPRODUCTNUMBER);
    split (10000, 1000);
    value int;
    default 0;
}

The iterate function merely has a few new directives:

int view_cnt;

iterate
    (   over visitsStream
        sortedby product
        filteredby(v) (v->category == BABY_CLOTHES)
        withevents productDetect ) {

    event prod_begin(product_t p) {
        view_cnt = 0;
    }

    event product_viewed (visitRec_t *v) {
        ++view_cnt;
    }

    event prod_end(product_t p) {
        pvc<:p:> = view_cnt + pvc<:p:>; // Adds this view_cnt to the existing value.
    }
};

In this case, the productDetect method handles filtering to the event calls. prodbegin is called each time a new product number is encountered, productviewed is called for each view, and prodend is called just before the number changes. pvc is just an instatiated copy of productViewCntm. Incidentally, while withevents takes a function, so does filteredby. In fact, the syntax above for filteredby simply creates an anonymous function to do the filtering. Declaring a function to handle filtering is trivial. It needs to return an integer, takes a single argument which must match the type mapped to by the stream. The values it returns are just like any other boolean comparison in C.

There are a lot of more advanced features to make data access more manageable. Views, Multi-Unions, Directories, Pickles, etc. The Hancock manual is pretty good, and I plan to dig into this technology deeper. Hancock is an interesting domain-specific language, with some features that we're starting to see become more mainstream through LINQ in C# 3.0. While AT&T's use of this technology may be subject to moral questions, this level of advanced data analysis has a lot of potential use. From analying customer usage for billing purposes, to investigating internet traffic for the purposes of noting security trends, and examining customer interests to better serve their needs, Hancock appears well suited to analyzing enormous amounts of data with relatively light hardware requirements. I'll be blogging more about this domain-specific language, and am considering a project to write in Hancock, just to figure out how it really works.