Technology Case Studies
Spam driving innovation in artificial intelligence
| Captcha technology in the lead | |
| posted 03-29-2009 |
Average Rating:
|
![]() By now, they’re familiar to anyone who has been around the Web a few times, showing up wherever comments or votes are solicited or visitors are asked to sign in or register themselves. They appear as small boxes containing one or two words, or a string of random characters, whose shapes look melted, perhaps, or fractured or torn. In some, the letters overlap, and often there’s a line or two slicing through the words and the background is teeming with random squiggles and patches of color. These blurbs of malformed text, which are the simplest form of something called a Captcha, have become the focus of a high-stakes, high-tech battle, a battle that perhaps neither side will ever win conclusively but that is spurring advances in that most-challenging corner of computer science, artificial intelligence (AI). Here’s why: Anyone typing the warped words correctly will get tagged as a human being of more or less normal intelligence and they’ll be allowed to proceed. Getting the words wrong a couple of times, however, and the website’s computer will recognize them as one of its own: another computer, dumb as a rock, and presumably a malicious machine, bent on splattering as many pages as it can with ads for this and that vice and questionable product. Evidently, computers can be programmed to generate visual puzzles that are relatively easy for people to solve but that stump other computers. But only temporarily. Spammers have shown that sooner or later, they can figure out how to jump over these digital turnstiles. By applying clever pattern-recognition algorithms, the bad hats can, in a tiny fraction of a second, decode many, if not most, of the squiggly texts put in front of them, thus freeing them to pepper blogs with ads, set up bogus email accounts en masse, rig online votes, and generally wreak havoc. Even if the trouble-makers can’t crack every such puzzle they encounter, a success rate of 90 percent, 75 percent, even 50 percent may be sufficient for their mass impersonation efforts to yield profits. Hence the ever-escalating Captcha-centered arms race. As researchers at Yahoo, Carnegie-Mellon University, IBM, and elsewhere scramble to create ever-tougher Captchas, spammers are engaging in their own counter-research. Just as has happened in other areas of computer security—firewalls, cryptography, and user-authentication, for instance—every release of a new, more challenging form of Captcha seems to spur spammers into figuring out a solution, which in turn prompts legitimate researchers to seek further enhancements. Research goes beyond defeating spammers Besides improving security for websites, this on-going cat-and-mouse game is actually yielding significant contributions to AI. The computer scientists at Carnegie-Mellon and IBM who came up with the idea of Captchas and have promoted their public use noted in a widely-read 2000 research paper that “either the Captcha is not broken and there is a way to differentiate humans from computers, or the Captcha is broken and a useful AI problem is solved.” In any case, progress will be made. The specific AI problem involved is that of pattern recognition, a subtopic of computer learning. Getting computers autonomously to identify patterns and objects has proven useful in everything from manufacturing to farming to national defense. In a factory, for instance, computers may scan parts moving along a conveyor belt to make sure each one is oriented properly and ready to be worked on by the next robotic tool. By analyzing live video imagery from highway cameras, computers are able to predict the formation of traffic jams. Intelligence agencies rely on sophisticated pattern-recognition techniques to sort and interpret the flood of images they receive from spy satellites. Even everyday consumer cameras are now able to recognize human faces and in some models, the moment when a subject smiles. Captcha research considers the problem of getting computers to recognize patterns that have been purposely mangled, and thus, it is helping with the advancement of cryptography. Typical to most cryptographic systems is a secret key, or password, that determines how a certain set of data is to be distorted, or transformed, and thus made unreadable. Depending on how many bits of data the key contains, an attacker who doesn’t know that key may have to spend hours or days or even decades of time on high-speed computers in order to figure out the key and decode the information—by which time, if the crypto-system has been designed properly, the information will be worthless. Thus, the goal is to make encrypting data easy while decrypting it is made sufficiently difficult to deter attackers. Today’s most robust coding systems achieve this goal by employing what mathematicians call a one-way function, typically relying on the enormous difficulty of finding the prime factors of very large numbers. The Captcha relies on another type of one-way function, rooted in what’s called a “hard AI” problem. The spammer’s computer “knows” what it’s looking for - any of 26 alphabetic characters plus the numerals 0 through 9, presented in random combinations. But by applying random and therefore unpredictable visual distortions to those characters, the Captcha program greatly increases the difficulty of computer recognition without hindering people’s ability to understand what they’re seeing. Again, given enough time, the spammer’s computer will eventually recognize even heavily distorted characters. But spammers rely on posting their messages in large volumes, which makes it relative easy to stymie them: If their AI software is able to decode Captcha’s with, say, a 25 percent success rate, simply requiring a site’s visitors to solve two Captchas in a row effortlessly reduces that rate to a discouraging 6.25 percent (0.25 x 0.25 = 0.0625). Carrying the work into other fields The work on cracking Captchas and developing better ones in response may lead to AI algorithms that would be applicable beyond just blocking spammers. It might, for instance, help with something called steganography, which in plain language is the science of hiding information or messages in plain view. It is possible to hide messages in digital photograph files, for example, and do that in such a way that no person can detect the message while viewing the photo or even by inspecting the file’s complete data. The message’s pattern is hidden, in essence, because it is sprinkled across the underlying data with no apparent pattern. But deeper knowledge about automatic pattern-recognition might be useful in finding those patterns and improving steganographic techniques. Similarly, better pattern-recognition methods could help with protecting copyrighted materials. As the original Captcha authors write, “a program that can find slightly distorted versions of original songs or images on the world wide web would be a very useful tool for copyright owners.” Ultimately, though, those researchers offered their analysis of the Captcha problem as a way to point out the “symbiosis” between AI and cryptography, and they called on security researchers to create new Captchas, based on different AI problems, and to release those creations for testing on the wild, wild Web. And many have taken up the call: • Microsoft developed a Captcha that involves distinguishing between cats and dogs in a set of randomly selected photographs. • CMU’s own Captcha group came up with SQUIGL-PIX, which requires visitors to trace with their cursors the outlines of certain figures in a set of photos - all statues, for instance, or all rafts. To make this difficult for computers, the photos don’t necessarily show a complete view of the item requested. And to make it difficult for computers to simply guess, each trace has to be fairly accurate. • Another CMU Captcha presents a set of images and asks visitors to select from a 50-item list the one word that relates to all of the images. So, shown photos of a mouse, guinea pig, mole, and gerbil, the correct word is “rat.” Enhancing the readability of the web Yet another Carnegie-Mellon approach turns the solution of Captchas into a good deed. With some 200 million Captchas being solved by humans every day, and each one requiring maybe 10 seconds of time, researchers thought why not get some useful work done, too? Their answer was to make Captchas based on snippets of optically scanned text from old books and newspapers that legitimate computers have been unable to correctly recognize. But how does the computer know if someone has interpreted the old text properly? Each reCaptcha presents a challenging sample of old text next to a standard Captcha image for which the answer is already known. It's assumed that anyone who solves the latter will also have entered a correct answer for the former. The two main sources of old texts are those in the Internet Archive (www.archive.org), a vast collection of freely-accessible digitized media, and scanned editions of the New York Times. Spammers have not been sitting on their hands, of course. Last fall, Russian hackers were revealed to have come up with software that regularly broke through the Captchas protecting Microsoft’s Hotmail service. Google’s Captchas also were found to be vulnerable. Stories have appeared on the Web about hackers actually paying people to solve Captchas, though the economics don’t seem particularly compelling. One such story, later debunked, had certain spammers surreptitiously employing unwitting visitors to pornography sites. Supposedly, when a spammer’s computer encountered a Captcha it couldn’t break itself, it would zap the image to a porn server for presentation to a human, adding a small incentive for solving it. Trouble is, not even porn pages are viewed in volumes sufficient to get Captchas solved fast enough to meet the needs of spammers. ______________________________ Profile of Alan Turing What humans still do better than computers ______________________________ COMMENTSCan computers accurately—and quickly—distinguish between humans and other computers pretending to be humans? Leave your response in the comments below.
3.26 Copyright (C) 2008 Compojoom.com / Copyright (C) 2007 Alain Georgette / Copyright (C) 2006 Frantisek Hliva. All rights reserved." |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







