September 2010 Archives

Sentiment Analysis

I learned about two awesome things this morning. First Yahoo! Research is on twitter. Second, there is a computer science conference celebrating women in computing named after Grace Hopper. The link from Yahoo! Research pointed to a paper from that Conference on the subject of managing communities by automatically monitoring their moods, based on the premise that a community that tends toward being very sad and angry, tends to discourage participation. Though this does need to be balanced with too much happiness, which aside from simply being treacle, implies that debate and discussion is simply not welcome in the community.

The paper, entitled Anger Management: Using Sentiment Analysis to Manage Online Communities, presents the findings of Yahoo’s Elizabeth Churchill, and Pamona College’s Sara Owsley Sood as they analyzed comments left on a sample of news stories to determine if the comment was relevant, and the overall tone of the comment. The most interesting discussion I saw, was that centering around the differences in language used in different problem spaces. ‘Cold’ is a positive word when talking about drinks, but negative when talking about people, for instance.

The research is, as most research is, based on other work. The relevance section is based on a 1988 paper that I need to read, where they took the algorithm from that other paper and used the article text as a source to compare the comments to in order to generate a relevance score. I’m guessing the analysis is done in a bayesian fashion, but what was really interesting was how this particular method of relevance analysis is how it could be applied to comment filtering on a blog or something similar. Lately, I’ve been deleting a lot of spam comments from here, that actually looked like they might have been non-spam, until I saw what post they were applied to.

The mood analysis was very interesting, though much of this paper was based on other papers that I have not read. However, they seemed to split the analysis into three categories: Happy, Sad, and Angry. Personally, I would like to see another dimension to the analysis, Hostility, that would attempt to detect the difference between someone who is passionate about a subject (which can be easily mistaken for anger) and someone who has gone hostile. But in my experience, the more dangerous thing in a community is not ‘anger’, but hostility. Still, to be able to do a near real-time analysis of mood based on text, which could potentially flag a moderator, has some interesting uses. Again, I suspect this analysis is fundamentally Bayesian.

It may seem that I find this work to be derivative, since I mention several times more papers I need to read. All academic writing generally leads to other sources that must be read to fully understand a topic, but this is also a short paper. It combines a few techniques to reveal a potentially very useful automated system to aide moderators (probably not replace, yet), and it shows it succinctly. It also raises a lot more questions than it’s able to answer at this time, in that the data coming out of this system could be used to aid analyzing and predicting trends in a community before small problems could become big ones. A short paper, yes, but one that may well serve as a pivot in moderation systems.

If there is a weakness to this system, it’s the same as any Bayesian modeling system, it requires a fair amount of domain-specific seed data to be able to determine mood. If a group were to implement this today, they’d need to spend a substantial amount of time training the algorithm for their domain for it to be most useful. Hopefully, corpuses of data can be formed around the various domains to train these sorts of algorithms more easily.

Government Involvement on the Internet

1 Comment

We’re right on the edge of the worst thing that could happen to the Internet, or at least America’s activity on it. First, this week, the Senate is considering in committee a bill named Combating Online Infringement and Counterfeits Act. Now, the idea behind this bill is to keep people from buying similar domain names (like if someone were to buy, which is a bad example since Google already owns that), and use them either to drive traffic to their own sites, or people who set up phishing scams.

In a way, this is a lot more important since May of this year, when ICANN began supporting internationalized domain names, since there are characters in Unicode that look very similarly in most fonts, but are technically different codes. This bill is a problem, because it allows for the creation of a list, maintained by the Attorney General’s office, which…well here’s the text

(i) a service provider, as that term is defined in section 512(k)(1) of title 17, United States Code, or other operator of a domain name system server shall take reasonable steps that will prevent a domain name from resolving to that domain name’s Internet protocol address;

Basically, if the Attorney General blacklists a domain, it is now the responsibility of an ISP (or probably anyone who runs a DNS server) to ensure that these domains no longer resolve. Alright, so what criteria get a domain on this list?

  1. Something that appears to only be around to violate Title 17, or copyright law.
  2. A site designed to sell ‘counterfeit’ material, as was defined in the [Lanham Act](
  3. A site engaged in selling something that is subject to forfeiture according to US Code regarding stolen property.

Clearly the justification behind this bill is to protect American business from illegitimate and illegal competition, which is fine. The problem with the creation of this list, is the precedent it sets. It’s government censorship (yes, of illegal activity, but that is beside the point). And, once something like this is created, it becomes far more likely that it’s scope will expand, but it would likely never go away.

Remember the 17th Century Proverb, Hell is paved with good intentions. The government feels this is a necessary expansion of powers to enforce copyright and trademark law in the Internet Age. That is not a bad goal, but if we sit back and let this happen, not only could this law be used to try to shutdown sites like YouTube, but the list could be more easily expanded in the future, and that’s the real danger. I, for one, gladly signed the petition at Demand Progress, and encourage you to do the same.

Then, yesterday, stories were posted regarding the US Government wanting to modify wiretapping laws in a way that would force service providers to be able to decrypt communications and provide them to the government. The New York Times has a really good story, but they require registration now, so fuck them.

The argument presented by the FBI and others in law enforcement, is that the internet has dramatically reduced law enforcements ability to monitor communications. This is true. The Plain Old Telephone System (POTS) was designed in a way that made tapping it trivial. In the earliest days, when calls were routed by a switchboard operator physically plugging in wires to connect two phones, many switchboard operators were notorious for listening in on calls. When the switchboards went mechanical, tapping it was still trivial, because all you need to do was go to the central office and tap into the leads that were attached to the line you were interested in.

By the time we went digital, the legal precedent for wiretapping telephony was well established, and the digital routing systems, in the interest of performance, didn’t use encryption, and later they used a symmetric encryption method that the phone company could decrypt but others couldn’t. Cell phone networks were designed by the same people who did the telephone networks, so they were also designed in a way that made tapping them fairly trivial, for the network operator.

These systems have another weakness in addition to their ease of tapping. They are easy to disrupt, because of their centralized nature. Cell phones operate at well established frequency ranges (for good reason), that make them simple to disrupt. GSM has an insecure default failure mode.

The Internet was designed to avoid these problems. It was designed to route around damage in the network, in order to allow communication to continue. And it works, sort of, many communities may only have one trunk entering them. But even if that community gets taken down, the number of people affected by the outage is relatively small.

This design philosophy of the internet has contributed to the design of other Peer-to-Peer (p2p) mechanisms, like BitTorrent and Skype, where a central server is not necessarily, and data can route between members of the network in order to reach it’s destination. The other piece necessary to make this work is encryption, so that the random masses hosting the nodes in the mesh network can’t spy on the traffic passing through. Some p2p topologies can even mask who the sender and receiver are, to an extent.

If this legislation (which has not been introduced yet) were to be passed into law, service providers, including companies like Skype, would be legally required to insert a backdoor into their services, that would allow them to decrypt user data to get access to information that they currently can not. This would allow them to turn that information over to the government when subpoenaed.

Good encryption is designed in such a way that the only people who can decrypt the data are the people who know the secret that was used to encrypt it. It comes in two kinds, slow public-key mechanisms where there is a public ‘secret’ that is used to send messages that can only be decrypted by a private secret, and symmetric, where there is a shared secret. Most real-time cryptosystems use public-key cryptography to authenticate each other, and then share a key to be used for faster symmetric operations. However, without the private key, or the shared key, a good encryption scheme should be impossible to break, without using brute force techniques. Which is why the NSA and NIST got so much bad press in 2007, when it was shown that a new encryption algorithm they released, DualECDRBG could have a weakness whereby a set of constants could exist that made it possible to decrypt any message encrypted using the constants in their reference implementation (and most implementations probably would have just taken the NSA constants).

It is entirely likely the NSA did not do this on purpose, and I’ll leave it up to you to decide how likely a conspiracy is.

Now, if the government gets it’s way, every encryption mechanism used, at least by US Based Services, perhaps by any encrypted communication system used by Americans, would need to have a backdoor like the one in DualECDRBG. However, the ‘secret’ keys needed to unlock the encrypted communication are an enormous weakness in the cryptosystem. Like HDCP, the encrypted standard used to close the ‘analog loophole’ and protect the output of blu-ray players and set-top boxes, had it’s master key leaked recently, meaning that the entire cryptosystem has been compromised. This has happened before. Many times.

But, going back to Skype. Let’s say Skype does implement an encryption mechanism that operates like this. Now, if the master key that Skype can use to decrypt any communication were ever to be leaked, then anyone running a Skype node would be able to decrypt all the communication running through their node.

Valerie E. Caproni, General Counsel of the FBI, was quoted in the New York Times saying that companies “could still advertise having strong encryption” since they’d only have to decrypt and turn over the data in the event of a subpoena. Ms. Caproni completely misses the point. The moment the ability for a third party, any third party, to read a message is introduced into a cryptosystem, it is automatically insecure.

At the end of the day, I’d rather terrorists and criminals have access to secure means of communications than give up my own access. Yes, law enforcement’s job is getting harder, but any tool that shifts liberty from the government to the people is going to do that, and any tool that can accomplish that is absolutely worth having around.

Legalize Food Production, Protect Migrant Farm Workers

Much of the food grown in this country is grown by workers with no legal right to be here.

This is something that we, as a nation, should be intensely ashamed of. Not because they’re here, but because they generally lack any sort of legal protection, since any abuse they may receive on the job is impossible for them to report without risking deportation. From a less humanitarian perspective, there are communities throughout the country where these migrant workers and their families can add additional load to the services of the region, without necessarily improving the funding of those services. This tends to be more of an issue for education, which is generally funded primarily by property taxes, and there is no such thing as a migrant homeowner.

Fact is, it really doesn’t matter whether or not you think that migrant farm workers should be granted amnesty. They need it. There are too many of them. And most American’s won’t work these jobs anyway, as evidenced by the United Farm Worker’s Take Our Jobs campaign, which has had less than two dozen Americans even try their hands at working on the farm.

Now, I don’t blame people for not taking part in this campaign. I certainly haven’t, and my experience with helping Catherine maintain our garden plot makes me pretty certain that I don’t want to be a professional farm worker. Small scale hobby farm stuff? Yeah, but I’m never going to be a commercial producer.

There has been one prominent American who took part in this challenge. Stephen Colbert, of Comedy Central’s The Colbert Report, took up this challenge, and presented his two part “Fallback Position” segment where he worked as a migrant farm worker.

The Colbert ReportMon - Thurs 11:30pm / 10:30c
Fallback Position - Migrant Worker - Zoe Lofgren
Colbert Report Full Episodes2010 ElectionFox News

The Colbert ReportMon - Thurs 11:30pm / 10:30c
Fallback Position - Migrant Worker Pt. 2
Colbert Report Full Episodes2010 ElectionFox News

Okay, yes, Colbert is a clown. Yes, he spends a lot of time making jokes. But that’s his job, he’s on Comedy Central. But more so, he’s demonstrating that these jobs are hard, backbreaking experiences, and the people working them are very hard working people, regardless of their legal status.

However, what really impressed me was the Congressional testimony he references in the second video, presented here by C-SPAN.

Ultimately, we need these people. They work useful, necessary jobs, that most Americans don’t want. Should immigration reform protect American’s options to work these jobs? I think so. And it’s entirely possible that with the higher wages that legal protection is likely to offer, more American’s might consider them.

If there is a downside to this, it’s that food prices are liable to rise. But the human cost born by migrant farm workers is one that, I think, is higher than the potential hit to my pocketbook.

Nature Twice: A Biologic Poetry Exhibit

At Washington State University’s Conner Museum last week, they opened a new exhibit, Nature Twice, which is a collaboration between WSU’s School of Biological Sciences and Department of English.

The exhibit features 40 poems about nature (a very small collection, obviously) chosen because there are specimens in the Conner Museum that the poem is about. To support the exhibit, the Museum put together a very nice booklet featuring interpretations of each poem by a graduate student poet, as well as biologic interpretations of the poem from a biology graduate student as well.

For the opening of the exhibit, there were readings from two local Poets, Linda Russo, a WSU Poetry Professor, and Ray Hanby, a WSU Poetry Master who wrote his thesis as a sonnet cycle on the salmon cycle. Both did readings of several of their poems, including one each that they were debuting for the occasion.

After the poetry reading, we were free to walk around the exhibit, viewing the poems and the exhibits. The Conner Museum is one of the largest collections of northwest animals anywhere, and while the exhibits may seem a bit dated, the specimens, which include hundreds of birds, but also aardvarks, kangaroos, bison, and, of course, a full-sized cougar, are immaculately maintained.

The Nature Twice exhibit will be available for the next several months, so I encourage you to visit and read the poems while making your way through the museum, however, even if you miss this exhibit, the museum is worth doing a walkthrough while you visit the WSU campus. It’s free (though donations are welcome) and you’ll see some really cool specimens.

Book Review: High Performance JavaScript

I’ve been reading the writings of Performace guru’s like High Performance JavaScript’s Nicholas Zakas and Steve Souders for several years. It was in part because of their writings, and the fact that Nicholas has worked on core parts of YUI that helped draw me to that library during it’s 2.x series, and has made me exceedingly pleased with the 3.x series. And I’ve read Souders’ book, High Performance Web Sites, which I’d found pretty fantastic as well.

High Performance JavaScript is, as the name implies, a focused book, specifically covering JavaScript, but it goes into a delicious level of detail on the topics that it covers, from the best way to load code, to accessing data, to string manipulation. These chapters are filled with hard data detailing what methods work best in which browsers, which is a huge deal when deciding the best overall solution based on your userbase.

Plus, the book addresses the coming changes in Web Standards with the older information. For instance, in the chapter on Responsive Interfaces, the bulk of the chapter talks about the techniques you can use to break up tasks for to keep the UI thread free to respond to the user. However, there are still a few pages devoted to the Web Workers stuff that’s showing up in Browsers which stands to revolutionize background threading.

And Nicholas also reaches out to others on the topics that he feels they’re better qualified to address. For instance, the Strings and Regular Expressions chapter was contributed by Steve Levithan. Actually, this chapter kicks ass regardless of what language you’re using. It’s generally great advice on how to avoid backtracking and defining bail-early Regexs that can save you a ton of execution time, though this being a JavaScript book, there is plenty of low-level nitty-gritty JavaScript stuff that he shows you to keep in mind. I have Steve’s Regular Expressions Cookbook, and after reading this chapter, I’m thinking it would be an awesome book to just read. But then, I love regex.

A fair amount of the material in the book was stuff I already knew. Still, the content, the deep dives into why something is, and the metrics to back up the assertions are fantastic, and even if you’ve done a lot of JavaScript over the years, there is probably something to walk away from. The fact that nearly half the book was written by contributing authors shows just how complex and nuanced many of the topics are, and that collaboration has made this book a lot stronger.

This is not a book for beginners, but for anyone else doing JavaScript, they’re bound to get something useful out of this book, or something that they’ll be able to refer to later. I’m glad to have this book in my digital bookshelf, and I’m sure you will be as well.

More ANOVA Data in R

In Catherine’s phylogenetic research, she has had the need recently to do some ANOVA analysis on a data set for her current project. Luckily, R has it’s stats module which has good support for this analysis via it’s cor function. However, the cor function, only returns the correlation matrix.

However, there is other relevant information generated from ANOVA that is relevant to the work that Catherine is doing, and is returned on a pair of columns from the cor.test function. She was mostly interested in the p-value, but cor.test also makes available a few other data fields.

To meet Catherine’s immediate need, I wrote the following function, which returns a list of matrices of results from cor.test, the first being the p-values, the second being the t-values, then the parameter, and finally the correlation.

corValues <- function(x) {
    if (!is.matrix(x))
        x <- as.matrix(x)

    size <- attributes(x)$dim[2]
    p = matrix(nrow=size, ncol=size)
    t = matrix(nrow=size, ncol=size)
    df = matrix(nrow=size, ncol=size)
    cor = matrix(nrow=size, ncol=size)

    i <- 1
    while(i <= size) {
        j <- i
        while (j <= size) {
            rv <- cor.test(x[,i], x[,j])
            t[i,j] = rv$statistic
            t[j,i] = rv$statistic
            df[i,j] = rv$parameter
            df[j,i] = rv$parameter
            p[i,j] = rv$p.value
            p[j,i] = rv$p.value
            cor[i,j] = rv$estimate
            cor[j,i] = rv$estimate
            j <- j + 1
        i <- i + 1
    list(p, t, df, cor)

It’s noticeably slower than the cor implementation, but it works fast enough. Mainly, I’d like to see this cleaned up to the point that I can at least take arguments similarly to the way the available methods do, but if you’ve got a matrix of data you want more than just correlation values for, the above does work fairly well.

Palouse Code Camp Approaches

We’re just under two weeks from the Code Camp event I’ve been helping organize, Palouse Code Camp. Currently, I’m planning to give two presentations, my Introduction to YUI3 talk I gave at Portland, updated for 3.2 and SimpleYUI. I’ll also be doing an Advanced YUI3 session talking about module creation, Y.Base, and the Plugin system.

If you’re planning to attend in Pullman, WA on September 18, go to our website and sign in via OpenID, it will automatically register you for the event. If you then want to give a presentation, you can simply submit it. Anything development related is fair game.

The website went up late last week, so we’ve got a limited number of sessions right now, but we hope to see you there, and we’re sure we’ll have plenty of exciting things coming soon.

How To Be Internet Famous

I know that at South-by-Southwest Interactive this year, Brian Brushwood is hoping to give a talk on how to Cheat, Scam, & Swindle Your Way to Internet Fame . However, I think I’ve managed to break it down into a few pretty simple steps.

First off, you don’t need to be extremely prolific. You’re YouTube Channel probably doesn’t even need to have anything. But then you just need to post one truly awesome video.

Now, it helps if you’re a attractive young woman, who’s video is about a subject close to any geek’s heart, but anything truly awesome will work. Of course with even Laptops these days more than capable of editing video, and reasonable quality HD video cameras available for under $1000, the barriers to entry are simply amazingly low compared to where they were even five years ago.

Once you’ve made and published your awesome video, with the Internet being what it is, you’re going to get at least some views almost immediately. Now, this is where the trick is. The video needs to get in the hands of the right people who recognize it for the awesomeness it is. Sometimes, as in the video above, the awesomeness is impossible to ignore. Other times, videos can languish for years before attracting mass attention.

Identifying the ‘right people’ is nearly impossible, so it’s usually not worth trying to bother. And if you’re not already Internet Famous, your own ability to pimp your own goods is diminished, however, it never hurts to spread the link around on Twitter, FaceBook or whatever.

However, you won’t get really big, until someone with resources notices you. For Brushwood, this was Revision3. For Rachel Bloom from the video above, it was Penn Jillette. Penn talking about it, happened just prior to a fairly large increase in views according to the YouTube stats. A large number of the comments on the video reference Penn’s talking about the video, though most of the more recent ones are…colorful.

Now, Penn Jillette is a legitimate celebrity, no doubt. But he also have over a million twitter followers, so when he posts a link, people see it. Not to mention all the people who see those who’ve retweeted those links. The numbers increase really quickly.

But that’s the game. The only way to become Internet Famous is to produce awesome content. Be persistent. And have just a little bit of luck.