That Computer Actually Got an F on the Turing Test

Over the weekend, a group of programmers claimed they built a program that passed the famous Turing Test, in which a computer tries to trick judges into believing that it is a human. According to new reports, this is a historic accomplishment. But is it really? And what does it mean for artificial intelligence?
Alan M Turing and colleagues working on the Ferranti Mark I Computer in 1951. Photo SSPLGetty Images
Alan M Turing and colleagues working on the Ferranti Mark I Computer in 1951.Photo: SSPL/Getty Images

Over the weekend, a group of programmers claimed they built a program that passed the famous Turing Test, in which a computer tries to trick judges into believing that it is a human. According to news reports, this is a historic accomplishment. But is it really? And what does it mean for artificial intelligence?

The Turing Test has long been held as a landmark in machine learning. Its creator, British computer scientist Alan Turing, thought it would represent a point when computers would have brains nearly as capable as our own. But the value of the Turing Test in modern day computer science is questionable. And the actual accomplishments of the test-winning chatbot are not all that impressive.

The Turing Test 2014 competition was organized to mark the 60th anniversary of Turing's death and included several celebrity judges, including actor Robert Llewellyn of the British sci-fi sitcom Red Dwarf. The winner was a program named Eugene Goostman, which managed to convince 10 out of 30 judges that it was a real boy. Goostman is the work of computer engineering team led by Russian Vladimir Veselov and Ukrainian Eugene Demchenko.

The program had a few built-in advantages, such as the fact that he was claimed to be a 13-year-old non-native English speaker from Ukraine. It also only tricked the judges about 30 percent of the time (an F minus, or so). For many artificial intelligence experts, this is less than exciting.

"There's nothing in this example to be impressed by," wrote computational cognitive scientist Joshua Tenenbaum of MIT in an email. He added that "it's not clear that to meet that criterion you have to produce something better than a good chatbot, and have a little luck or other incidental factors on your side."

Screenshots on the BBC's article about the win show a transcript that doesn't read like much more than a random sentence generator. When WIRED chatted with Goostman through his programmers' Princeton website, the results felt something like an AIM chatbot circa 1999.

WIRED: Where are you from?
Goostman: A big Ukrainian city called Odessa on the shores of the Black Sea

WIRED: Oh, I'm from the Ukraine. Have you ever been there?
Goostman: ukraine? I've never there. But I do suspect that these crappy robots from the Great Robots Cabal will try to defeat this nice place too.

The version on the website could of course be a different version than was used during the competition.

This particular chatbox almost passed a version of the Turing test two years ago, fooling judges approximately 29 percent of the time.

Fooling around 30 percent of the judges also doesn't seem like a particularly high bar. While the group claims that no previous computer program has been able to reach this level, there have been numerous chatbots, some as far back as the 1960s, which were able to fool people for at least a short while. In a 1991 competition, a bot called PC Therapist was able to get five out of 10 judges to believe it was human. More recently, there have been fears that online chatbots could trick people into falling in love with them, stealing their personal information in the process. And a 2011 demonstration had a program named Cleverbot manage a Turing Test pass rate of nearly 60 percent.

So where does this 30 percent criterion stem from? It seems to be a particular interpretation of Alan Turing's 1950 paper where he described his eponymous test.

"I believe that in about fifty years' time it will be possible, to programme computers... to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning," wrote Turing (.pdf).

So the father of the Turing test wasn't using this as some threshold for intelligence, he was simply stating his prediction of where he thought computers would be five decades in the future.

For most modern-day artificial intelligence experts, the Turing Test has long since been superseded by other accomplishments. It's not entirely surprising that a 65-year-old test doesn't hold up, given the lack of data about intelligence — both human and artificial — available at the dawn of the computer age. Today, we have programs that show quite interesting intelligent-like behavior, such as Netflix's suggestion algorithm, Google's self-driving car, or Apple's Siri personal assistant. These are all tailored to specific tasks. What Alan Turing had envisioned was a machine that was generally intelligent; it could just as easily organize your schedule as learn Latin.

This has lead cognitive scientist Gary Marcus of NYU to suggest an updated, 21st-century version of the Turing Test. Writing at the New Yorker's Elements blog, he said that a truly intelligent computer could "watch any arbitrary TV program or YouTube video and answer questions about its content—'Why did Russia invade Crimea?' or 'Why did Walter White consider taking a hit out on Jessie?'" Marcus continues:

Chatterbots like Goostman can hold a short conversation about TV, but only by bluffing. (When asked what “Cheers” was about, it responded, “How should I know, I haven’t watched the show.”) But no existing program—not Watson, not Goostman, not Siri—can currently come close to doing what any bright, real teenager can do: watch an episode of “The Simpsons,” and tell us when to laugh.

Of course, who knows what they'll say about that test in 50 years time.