Bashing Siri, the iPhone 4S virtual assistant, seems to be fashionable these days. Mat Honan declares it “Apple’s broken promise“. CNN reports on Siri’s alleged anti-abortion bias (via Danny Sullivan). Colbert weighs in. John Gruber remarks how weird it is for Apple’s flagship new product to be “so rough around the edges”, yet notes that it will be easier to improve voice recognition while it’s being widely used.
It’s not just easier, it’s the only way!
I worked on speech recognition with IBM Research for nearly six years. We participated in DARPA-sponsored research projects, field trials, and actual product development for various applications: dictation, call centers, automotive, even a classroom assistant for the hearing-impaired. The basic story was always the same: get us more data! (data being in this case transcribed speech recordings). There is even a saying in the speech community: “there is no data like more data“. Some researchers have argued that most of the recent improvements in speech recognition accuracy can be credited to having more and better data, not to better algorithms.
Transcribed speech recording are used to train acoustic models (how sound waveforms relate to phonemes), pronunciation lexicons (how do people actually mis-pronounce words, specially people and place names), language models (spoken phrases rarely conform to the English grammar), and natural language processors. And that for each supported language! More training data means the recognizer can handle more variations in voices, accents, manners of speech, etc. That’s undoubtedly why Nuance for example offers a free dictation app.
It is tempting to consider Siri as some kind of artificial intelligence, who, once trained properly, can answer all sorts of questions. The reality is that it is a very complex patchwork of subsystems, many of which handcrafted.
To improve Siri, engineers must painstakingly look at the requests that she could not understand (in all languages!) and come up with new rules to cope with them. There are probably many, many gaps like “abortion clinic” in the current implementation, which will be fixed over time. When Apple states “we find places where we can do better, and we will in the coming weeks”, they are plainly describing how this process works.
It is important to understand that unlike Apple’s hardware and app designs, Siri’s software could not have been fine-tuned and thoroughly tested in the lab prior to a glorious release. It had to be released in its current form, to get exposure to as much variability as possible all the way from the acoustics to the interpretation of natural language. For each of the funny questions that Apple’s engineers had anticipated, poor Siri has to endure a hundred others.
If the rumors of a speech-enabled Apple TV are true, then Siri will soon have other challenges. For example, far-field speech recognition is notoriously more difficult than with close-talking microphones. She had better take a head start with the iPhone 4S.
[UPDATE There has been a lot of interest in the article, I thought I would clarify a few things]
-I have no inside information. Everything I wrote about Siri is an educated guess based on my own experience. I may be totally wrong, and I probably missed some important parts of the story.
-I did not mean to imply that Siri’s system is rule-based. I am convinced that it relies heavily on statistical learning. But someone has to train, fine-tune, test and debug statistical algos with new data and new use cases. Sometimes you just throw in the new data and press the “retrain” button. Sometimes you have to dive in and adapt algorithms. And sometimes, in order to squeeze the last few percentage points, you may write some old-fashioned rules, like for Siri’s quirky replies.
-As a few commenters pointed out, Apple has already gathered a lot of data from the previous Siri app. I think they used it to build the best system they could, which is already quite impressive IMO. They had to release it to be able to go even further. New data brings diminishing returns: at some point, 20% or 50% more data is insignificant, you want 10x or 100x more.