Voice Assistants in the “Uncanny Valley”
By Marie Kleinert | Senior Voice User Interface Architect at VUI.agency | 08.12.2020
In conversations with colleagues and customers, I often come across one question: “How human can or should a synthetic voice sound like?” Two statements usually stand opposite each other: “No, that sounds terrible, tinny and unnatural.” Vs. “That’s creepy when you don’t even hear that it’s actually a machine speaking.” These discussions show perfectly that it is not enough to develop the technologies as fast as possible towards human intelligence and naturalness. Designers and developers of voice assistants must also pay attention to the hurdles of voice assistants in the “Uncanny Valley”.
Thanks to Anne Lindner for sharing her work with us
Voice assistants, social robotics theory and their challenges
Voice assistants are the first social robots that have really made it into our everyday life. But there are still difficulties and often you hear from frustrated users. Social robotics has a theory about this.
The interaction between humans and machines is to a large extent characterized by the fact that we as humans have the peculiarity to assign human characteristics to machines or computer systems. The car gets a name, the writing program has a bad day, or the coffee machine needs to be fed. These examples are machines and systems that do not claim to have human qualities, and yet even here it comes to anthropomorphizing, humanization, of these things.
So, what about robots and virtual assistants that are designed to resemble humans?
On the one hand, a human-like appearance has the advantage of familiarity and thus facilitates communication. On the other hand, the problem with human appearance or behavior of machines triggers the expectation of human abilities in users, which cannot always be fulfilled. Striking a balance between “behaving like a human” and “being non-human” poses a particular challenge.
The Uncanny Valley
The theory of the “Uncanny Valley” established in the 1970s deals precisely with this balance problem. It mainly refers to robots with a human-like appearance and human-like motor skills.
In a thought experiment, Masahiro Mori put forward the thesis that affection for a robot increases steadily to a certain degree of human resemblance. If this critical degree is exceeded, the strong human resemblance triggers a feeling of uneasiness. This leads to an abrupt drop in affection and acceptance. Only when the design of the robot is perfected so that it looks deceptively like a human, a positive reaction from the human would be expected again.
Today, not many people interact with such anthropomorphic robots, but they do interact with voice assistants. Although they do not imitate a human appearance or movement, they do imitate other characteristics and abilities such as speech, emotion, character, and intelligent action. These human characteristics can also be classified in the Uncanny Valley paradigm.
Thanks to Anne Lindner for sharing her work with us
A decisive explanation for the abrupt drop in user acceptance cannot only be related to movement and appearance. Characters with inconsistently artificial and human features are perceived more negatively than characters that are consistently artificial or human. Imagine a voice assistant that has the best AI, has an answer to every question, but it speaks with a voice that is not well synthesized – resulting in an unnatural intonation, speaking in a choppy manner.
Would this voice assistant be fully accepted? Not very likely.
The same would be the case with an inverted situation. A voice assistant with a perfectly synthesized voice, but whose speech recognition is poor and from whom you would often just hear “I’m sorry, I didn’t understand that”, would also attract negative attention. Probably even more. After all, with the first sentences, it might make some users believe that they are dealing with a human being.
Inconsistency, the deviation from expectations at certain points, leads to users not knowing where they stand – a feeling of insecurity or discomfort arises. The voice assistant drops into the „Uncanny Valley“.
It is both exciting and difficult when several human aspects, such as appearance and voice, meet. How do the different aspects influence each other in their own expressions of anthropomorphism? Here it seems to be of particular importance that the individual aspects have the same degree of human likeness so that a coherent robot or assistant is created that users will accept.
How do the different aspects influence each other in their own expressions of anthropomorphism? Here it seems to be of particular importance that the individual aspects have the same degree of human likeness so that a coherent robot or assistant is created that users will accept.
Avoiding inconsistencies – a challenge
A good example of this is some movies that work with CGI technology, where the characters are based on real actors. The characters move very human-like, and they get the dubbing voice of the respective actor. However, the appearance and facial expressions can look very artificial, but still resemble the real human, who may even be known to the audience. For the audience this sometimes seems strange – is it the actor or not?
These inconsistencies can occur not only between different human characteristics and abilities but also within a single characteristic.
The human voice and speech are particularly complex in this respect. Does the tone of voice fit the situation? Does the voice match the person’s age? Does the intonation fit the core statement of the sentence?
There are so many factors in the voice alone that should fit together to present a natural voice to listeners. The difficulty of developing human-like machines that we accept among ourselves can sometimes be viewed with some pride – we are special and not easy to copy. Nevertheless, we want to be supported by machines in the most human and natural way possible. For this, we should decide what degree of human likeness we want to achieve. This must be communicated consistently across all aspects and thus meet the expectations of the users.
Next steps in Conversational Design
In Conversational Design, we pay attention to exactly these pitfalls. Above all, our work enables us to keep the aspects of personality, speech, and intelligence consistent, and to coordinate them with each other. Finally, we should of course put these aspects in connection with the voices that are available to us depending on the platform. A balance that is not easy to achieve – but which makes our job even more exciting.
In the end, the goal is for users to be able to concentrate on the actual interaction and to know what they can and cannot expect. What do you think?