The experience of using voice commands to control computers has been transformed by a new generation of voice interaction systems. Apple’s Siri and Google Now, which have been available for a few years, prove that this technology is no longer in its infancy. It’s now more like a toddler—just beginning to walk and talk but still regularly falling down, and often speaking complete nonsense.
Siri and Google Now are embedded into smartphones, but the Echo, a newer voice interaction system from Amazon, takes a different approach, offering voice interaction with a stationary device rather than as part of a mobile operating system.
On paper, Siri, Google Now, and Alexa (the name used when speaking to the Echo) seem pretty similar. They all continuously listen for an "activation phrase," then execute user instructions for tasks such as playing music, looking up information, and setting timers.
The funny thing is that even though the Echo was released several years after Siri and Google Now, compared to those systems, Alexa isn’t very smart. Consider the results when you ask the same question to all three systems:
What temperature should chicken be cooked to?
Google Now: "One-hundred sixty-five degrees Fahrenheit; according to Kitchen Fact, the safe internal temperature for cooked chicken is one-hundred sixty-five degrees Fahrenheit."
Siri: "Let me check on that . . . Okay, I found something on the web for what temperature should chicken be cooked to . . . take a look . . . ."
Alexa: "Sorry, I didn’t understand the question I heard."
Clearly, Alexa is not the brightest bulb in the box. She’s also missing something critical that both Siri and Google Now offer: a screen to show visual output. The Echo does have a companion smartphone app, but it’s focused on adjusting settings and other auxiliary functions. The core interaction takes place with the device itself, which is essentially a computer that must be kept plugged into a wall and that includes only a speaker and microphone. Without a screen, the Echo can’t display rich output such as a list of search results.
On the face of it, the Echo sounds like a big step back, especially considering that most of the technology enthusiasts who tend to be early adopters of new products probably already own a smartphone equipped with Siri or Google Now.
So why would people want to pay for a device that has fewer capabilities than one they already own?
One of the classic usability heuristics is error prevention: the notion that rather than just help users recover from errors, systems should prevent errors from occurring in the first place. As speech recognition has improved in recent years, errors in natural-language understanding have dropped significantly. Hopefully this trend will continue for all voice interaction systems.
But there’s one clear type of error that is quite common with smartphone-based voice interaction: the complete failure to detect the activation phrase. This problem is especially common when there are multiple sound streams in the environment—such as if the device is playing music and you give a command to stop the music, as shown in this video:
(In most browsers, hover over the video to display the controls if they're not already visible.)
As illustrated in the recording above, Siri often fails to detect voice commands when there is interfering noise, such as music. Siri may also ignore commands when the device is relatively far away or when you have the device in a pocket or in a purse. If you’ve set a timer and want to know how much time remains, a typical experience with Siri might go like this:
"Hey Siri, how much time is left?" (with the phone in your pocket)
"Hey Siri, how much time is left?" (after taking the phone out of a pocket)
Siri: "I found an article about the Times. Shall I read it to you?"
"No, how much time is left on the timer?"
Siri: "Here’s the timer. It’s running at eight minutes and eight seconds."
The Echo, on the other hand prioritizes voice interaction above all else. It includes seven microphones and a primary emphasis on distinguishing voice commands from background noise (as opposed to the iPhone 6s, which takes a compact mobile device with a screen and incorporates only two microphones). The results are dramatic: Never mind taking it out of your pocket, even from across the room, Alexa reliably responds:
"Alexa, how much time is left?"
Alexa: "About 6 minutes and 10 seconds."
Besides the superior voice recognition, there’s also a difference in the semantic processing of the two examples above. Alexa interpreted "time" as referring to the device timer, while Siri assumed it was a general web query, and didn’t relate it to the device timer until the specific keyword "timer" was added. Siri’s ability to expedite web searches with voice input for queries is certainly valuable, but the bias toward interpreting user questions as web searches can actually increase error rates when doing other tasks. The benefit of the Echo’s more focused functionality is even more apparent if you need multiple timers (not an uncommon scenario when cooking). When asked to set a new timer, Alexa easily responds, "Second timer set for 40 minutes, starting now," while Siri, which only has one timer, balks: "Your timer’s already running, at 9 minutes and 42 seconds. Would you like to change it?"
Siri’s less powerful voice detection isn’t always a deal-breaker—it depends on the task. When searching for information, usually you would need to be close to a screen to see detailed results anyway, so getting the phone out of the pocket will not increase the task time beyond what it would be without voice control. And speaking the command is likely to be faster than typing it anyhow.
But for short tasks, failing to hear a command the first time can easily tip the balance, and make the voice system more cumbersome and time consuming than an existing physical alternative, such as glancing at a digital timer or walking across the room to flip a light switch. New technologies must make tasks faster and easier in order to be viable replacements for existing tools. For short tasks, voice detection errors can make this impossible.
Alexa’s superior accuracy in detecting voice commands from a distance and despite background noise affects another core usability principle: the degree to which the system provides flexibility and efficiency.
Smartphone voice interaction systems can accompany you everywhere, while the Echo with its larger size and need for continuous power is constrained to operate only within the home environment. But paradoxically, within that environment, this comparatively clunky device enables far more flexibility for users, who don’t need to carry the thing around constantly in order to have instant access to its capabilities. Consequently, you have a wider range of options for how and when you can use the device—you can talk to it while you’re in the middle of another task such as cooking, or from your bed or couch, if you have restricted movement due to a disability or simply if you just don’t feel like getting up.
This type of pure voice control requires continuous listening for the activation phrase. Although having your every word monitored by a device may seem creepy, the immediate result is that the voice recognition device can actually be less intrusive—because you don’t have to remember to carry it around, or take it out and look at it instead of at your companions.
On the other hand, one of the Echo’s biggest limitations is that some tasks are still painfully inefficient. Adding one item to the shopping list is easy and quick, but adding several items quickly becomes repetitive and time consuming. For each item you must state the activation word, then the command, and then wait for a verbal confirmation before repeating the process for the next item. The result is a conversation like this:
User: "Alexa, add milk to the shopping list."
Alexa: "I’ve added milk to your shopping list."
User: "Alexa, add eggs to the shopping list."
Alexa: "Eggs added to your shopping list."
User: "Alexa, put butter on the shopping list."
Alexa: I’ve added butter to your shopping list.
User: "Alexa, add cereal to the shopping list."
Alexa: "I’ve added cereal to your shopping list."
User: "Alexa, put cheddar cheese on the shopping list."
Alexa: "Cheddar cheese added to your shopping list."
User: "Alexa, put sugar, flour, and salt on the shopping list."
Alexa: "I’ve added sugar flour salt to your shopping list."
After about the third item, you start thinking that surely there must be a faster way to do this. You can shortcut the process by saying several items together, but the list ends up showing the entire text string as a single list item.
In other ways, Alexa is worse than Siri or Google Now at upholding usability heuristics, primarily due to the lack of a screen. Visibility of system status is limited to an animated light ring; while this is well executed, it’s a far cry from the rich textual feedback possible with a screen. Also, supporting recognition over recall is also severely constrained in a voice-only interface, since even reciting a list of options requires users to store the options in working memory while they make a selection.
When new technologies become available, enthusiasts are often quick to declare that we need to start from scratch and reinvent design methods and principles that are more appropriate to the new technology. Eliminating a visual display certainly transforms the interaction experience. But does the shift from visual to auditory output mean that all the rules have changed?
No matter how different the technology, the people who are using it haven’t changed. And most usability principles have more to do with human capabilities and limitations than with technology. (Examples of such eternal design principles discussed in this article include error prevention, flexibility, efficiency, visibility of system status, and recognition versus recall.) The Echo offers unique value, even to users who already own good voice interaction systems. Although the medium of voice is quite different, both the frustrating errors and the seemingly magical success moments experienced when using the Echo can be clearly traced to tried-and-true usability heuristics.