Almost 25 years ago, researcher Xuedong Huang founded the speech recognition program at Microsoft Research (MSR). His groundbreaking work ended up in Microsoft’s products like Cortana and Kinect. Today, voice recognition is pretty much figured out. But while computers hear us well, they still don’t see us very well. Gestural interfaces are still rudimentary. We may have virtual reality at home—and yet, those systems can’t even make out our own hands.
That may change soon, as Huang says a "paradigm shift" is happening within Microsoft Research. In a newly released demo of its gesture platform, Handpose, the company is revealing an unprecedentedly accurate hand tracking system that requires so little processing power that it could scale from computers to tablets to VR headsets.
Huang, who now consults on gesture research at MSR labs spanning the globe, explains that gestures have been stuck where voice recognition was in the 1970s.
"A very simple way to understand it is, in the '70s, for every word, we had a whole template," he explains. So "banana" had an image in the computer, essentially, that matched the word up to your utterance. Better voice recognition introduced more and more of these templates for "banana" to understand more ways different people might pronounce the word.
In the '80s, a profound shift happened, he continues. Voice recognition systems began analyzing phonemes—the unique sound chunks that together make up words—rather than entire words, so a whole logic system could be built that mixed and matched different sounds to postulate what you might be saying. Add a few decades of data, and mountains of information collected by services like Google, and voice recognition is pretty good.
Most gesture systems, including Microsoft Kinect, still use this simple style of template matching. But Handpose, MSR's new gesture recognition system, abandons those templates completely. Instead, it incorporates what it's calling a "gesture vocabulary." The system looks at your hand and, instead of seeing it as a whole blob that needs to match to something preprogrammed in the system, it breaks up your hand into independent pieces—so it can reason how chunks of your fingers and knuckles curl into a fist. "Those core elements are almost like a phoneme for a pronunciation of a word," says Huang.
Suddenly, a vision system like Kinect, which can currently only recognize large sweeps of your hand using a broad image-matching technique, could use these finger phonemes to identify fine motor controls like grasping tiny objects or touch-typing on a holographic QWERTY keyboard floating in midair. It might seem like MSR is playing with semantics, but Huang views this gesture vocabulary as the "physics to express ourselves."
While Huang won’t share what products will be seeing these gesture updates, the potential implications are obvious across Microsoft’s portfolio—at least any device that can fit a depth-sensing camera inside. The Xbox’s Kinect could finally live up to its potential to recognize tiny motions. The Microsoft Surface could work largely without a keyboard. The Hololens could provide virtual UIs to challenge keyboards and mice.
And Microsoft would finally have the opportunity to build a full-out Echo-killer to take over the home. Whereas Amazon uses a microphone array to connect you to Alexa, and Apple and Google mostly rely on their phones to connect to spoken AI, Microsoft could use voice, combined with gestures, to create a more discreet, empathetic, ambient computing environment for us all.
"Right now, everyone is holding a phone. You have to touch the phone to talk to it, or you have to touch the keyboard. Everyone is being tethered to the device. Mobile freed people from being tethered to a PC . . . but they’re being tethered to their phone!" says Huang. "Just imagine one possibility. You’ll be surrounded by many intelligent devices . . . and you could engage with them like real people."