Sentiment Analysis

I learned about two awesome things this morning. First Yahoo! Research is on twitter. Second, there is a computer science conference celebrating women in computing named after Grace Hopper. The link from Yahoo! Research pointed to a paper from that Conference on the subject of managing communities by automatically monitoring their moods, based on the premise that a community that tends toward being very sad and angry, tends to discourage participation. Though this does need to be balanced with too much happiness, which aside from simply being treacle, implies that debate and discussion is simply not welcome in the community.

The paper, entitled Anger Management: Using Sentiment Analysis to Manage Online Communities, presents the findings of Yahoo’s Elizabeth Churchill, and Pamona College’s Sara Owsley Sood as they analyzed comments left on a sample of news stories to determine if the comment was relevant, and the overall tone of the comment. The most interesting discussion I saw, was that centering around the differences in language used in different problem spaces. ‘Cold’ is a positive word when talking about drinks, but negative when talking about people, for instance.

The research is, as most research is, based on other work. The relevance section is based on a 1988 paper that I need to read, where they took the algorithm from that other paper and used the article text as a source to compare the comments to in order to generate a relevance score. I’m guessing the analysis is done in a bayesian fashion, but what was really interesting was how this particular method of relevance analysis is how it could be applied to comment filtering on a blog or something similar. Lately, I’ve been deleting a lot of spam comments from here, that actually looked like they might have been non-spam, until I saw what post they were applied to.

The mood analysis was very interesting, though much of this paper was based on other papers that I have not read. However, they seemed to split the analysis into three categories: Happy, Sad, and Angry. Personally, I would like to see another dimension to the analysis, Hostility, that would attempt to detect the difference between someone who is passionate about a subject (which can be easily mistaken for anger) and someone who has gone hostile. But in my experience, the more dangerous thing in a community is not ‘anger’, but hostility. Still, to be able to do a near real-time analysis of mood based on text, which could potentially flag a moderator, has some interesting uses. Again, I suspect this analysis is fundamentally Bayesian.

It may seem that I find this work to be derivative, since I mention several times more papers I need to read. All academic writing generally leads to other sources that must be read to fully understand a topic, but this is also a short paper. It combines a few techniques to reveal a potentially very useful automated system to aide moderators (probably not replace, yet), and it shows it succinctly. It also raises a lot more questions than it’s able to answer at this time, in that the data coming out of this system could be used to aid analyzing and predicting trends in a community before small problems could become big ones. A short paper, yes, but one that may well serve as a pivot in moderation systems.

If there is a weakness to this system, it’s the same as any Bayesian modeling system, it requires a fair amount of domain-specific seed data to be able to determine mood. If a group were to implement this today, they’d need to spend a substantial amount of time training the algorithm for their domain for it to be most useful. Hopefully, corpuses of data can be formed around the various domains to train these sorts of algorithms more easily.