New CAPTCHA Technology?

By on April 24th 2008

The Completely Automated Public Turing to tell Computers and Humans Apart (CAPTCHA) Test is supposed to be a method to determine if an entity connecting to a service (typically via the web) is a Human or a Computer. Traditionally, this has been accomplished by generating ‘words’ and rendering them as either an image or an audio file, such that a person could easily recreate the intended text, but a program would struggle.

CAPTCHA began simply, with nothing to obscure the data, which was quickly beaten by existing OCR technologies. Some people, like Coding Horror still use these styles of “naive” CAPTCHAs since what they are protecting simply isn’t important. As time went on, and CAPTCHA began to be used for more and more sites, including financial and e-commerce sites different techniques for adding ‘noise’ to the image were developed. However, as with all things, money called out and people figured out how to hueristically break almost any CAPTCHA. TicketMaster works hard trying to keep people from buying too many tickets (as an effort to curtail scalping), but there is so much money to be made reselling concert and show tickets that TicketMaster’s CAPTCHA only stops the most amateurish scalpers.

Even Google, who’s CAPTCHA was considered one of the strongest was broken recently, causing hordes of fake GMail accounts to be created, and leading some people to Spam blacklist mail originating from GMail. The amount of money to be made by breaking CAPTCHA seems to far exceed the amount to be saved by doing it right in the first place. All anyone seems to know now is that clearly CAPTCHA as it exists today is virtually pointless.

For a while, I was trying to set up reCAPTCHA support on this blog, to try to cut down on spam comments (any unautheticated commentors need to be approved by me right now, the filters let through a handful of spam comments a day into this holding pen). I didn’t do this because I thought reCAPTCHA was unbreakable, I know that it isn’t. The idea was that by presenting a slightly less open target, people would hopefully stay away. Unforuntunately, I had trouble with the MovableType Plugin that was available, and never really took the time to correct the errors. Still, reCAPTCHA was a cool idea, offering a CAPTCHA to solve a difficult OCR problem (which incidentally, would be a great method to break other CAPTCHAs). It was simple, but it did enough for my basic purposes.

Ultimately, any sites that depend on any real security from the CAPTCHAs need something far better than what currently exists, and there are several inviting technologies in this space. One of my student loan providers uses a login CAPTCHA where I had to select an image from several dozen images, and enter in a word that I could remember in relation to that image. Then, they show me four images on login, and ask me to select my image and enter the selected word. Not safe from shoulder-surfing, but nothing really is, but far less likely to be guessed via remote attack or bot.

Most interesting in the world of CAPTCHA research is the work being done at Penn State University by Professor James Wang entitled IMAGINATION. It’s a two-stage CAPTCHA focusing on images. The first challenge requires you to click on the center of any photo in a collage of photos that might overlap, followed by identifying a fairly complex image by choosing a word from a list.

These tasks are pretty trivial, but there is enough noise on the images, with really messed up colors to make identification harder. Is it perfect? No, it will someday be broken, but it’s a strong step forward. interestingly enough, many attempts to crack CAPTCHA implement some fairly advanced machine learning algorithms, and that will become far more important as CAPTCHA evolves. Perhaps the same element of society which profits from breaking CATPCHA may become the leaders in AI research into the future?