Is contextual voice experience the next breakthrough in voice user interaction?

Mohamed Hassan 24.03.2021 7 Minutes

We took part in the Chatbot Africa & Conversational AI Summit 2021 and this is a follow up on our talk: “Contextual voice experiences.”

There was a surprising amount of engagement and interest in the summit. It was interesting to connect with voice experts from the African region. Chatbots and virtual assistants are having a moment right now, transforming millions of businesses all over the world, and accompanying us in our everyday lives from our phones and smart speakers.

During the conference, the talks regarding the challenges of connectivity, language support, and localization were headliners. It was a pleasant surprise to see that the keywords throughout the presentations were inclusivity, and the AI diversity challenge, which is also one of the main tech challenges we need to solve quickly, a fact clearly pointed out by the Standford University 2021 AI Trends report.

VUI.agency Chatbot Africa & Conversational AI Summit 2021 Mohamed Hassan

Furthermore, if we want to create intelligent assistants, we need to understand the general intelligence problem by including everyone. Everyone means literally everyone. It is still a big challenge for nowadays voice assistants to accommodate more languages, dialects, and even cultural concepts.

Voice assistants still can’t even fully understand us even though they speak the same language/dialect.

Although voice assistants today can talk properly in some languages, they still can’t fully understand what we – inclu di ng people with disabilities – say most of the time or when kids use their more simpler vocabulary. AI had this language understanding challenge for quite a long time. Maybe they are missing a totally different component of understanding.

Or let’s say that they are missing a component of intelligence as this Facebook research suggests:

“If AI systems can glean a deeper, more nuanced understanding of reality beyond what’s specified in the training data set, they’ll be more useful and ultimately bring AI closer to human-level intelligence.”

Since people use language to communicate, the learning curve to use voice assistants shouldn’t be that long. Talking to a voice assistant will immediately reveal its limitations. Can it understand the difference between “here” and “there”? What does that mean spatially? Is there a relationship between them?

Understanding this relationship means grasping reality beyond the trained utterances.

After all, at the end of the day, we can’t label everything in the world. Common sense and context recognition are concepts that need to exist in an intelligent voice assistant. Common sense helps humans accomplish new tasks with mental models where they don’t need that much training. Imagine what you can do as a child versus as an adult. This is where the “Contextual” factor in voice experiences comes from and this is my current research focus in my Ph.D. studies.

Let’s have a look at the already available features that help voice assistants be smarter and understand more context.

Currently voice assistants – in the market – can make use of the following data:

Conversation history

It helps not only with training purposes but also with deciding on probabilities of what might the users ask next.

User profile information

It helps with customizing dialogues according to age and preferences.

Location/time

It helps with improving the proactiveness of voice assistants and also with increasing user satisfaction, and long term usage.

Usage data

It helps with understanding the level of engagement and drop-outs

Input/touchpoints

It helps with the decision on which devices or services to start/continue the experience on.

Sentiment analysis

It helps with adjusting the dialogues according to mood/emotions and even understanding the intention. (There is a difference between saying “I need daddy” in a low slow tone versus a high loud tone. The latter might indicate an emergency request)

However, we need more efforts that build on top of previous features. We need common sense or a context management component in the technology behind voice assistants.

VUI.agency – contextual voice experience process

The AI research today has an interest in modeling common sense and enabling the machine to learn ley concepts by itself. Self-supervised learning is a key aspect to unlock that “Context Management” component in voice tech.

One important benefit from that component is that the voice assistant will understand the relationship between what has been said before in the conversation history and draw assumptions which will lead to useful conclusions on the actual intention behind a new request. In other words, “I need Daddy” will mean “Please tell daddy to come to me” (Based on our assumption on the few seconds we saw in that video).

The relationship between people or objects in a home setting is very important to augment the voice assistant’s ability to make decisions i.e. respond.

Mohamed Hassan / Senior Voice / AI Experience Architect at VUI.agency

This relationship helps voice assistants to understand the difference between “here” and “there” or “inside” and “outside” or even the meaning behind a “living room” and “bedroom” and which activities and usage take place in a normal day or location of furniture and devices in these rooms.

A question like: “Does it make sense to light all of bedroom lights at 3 a.m. while people are sleeping inside?” could be easily answered by a voice assistant that has been given all required parameters.

With that level of understanding, we can have a voice assistant that can really offer assistance and even maybe intelligence beyond the mainstream command-and-control type of assistants.

Today’s voice assistants’ archetypes – based on the nature of conversation – can be grouped into:

Command-and-control

An assistant that awaits to be addressed and commanded to directly fulfil the user’s intentions. A conversation with this assistant can result in a dialogue. (If you ask for a recipe, it will show you a recipe)

Assistive

An assistant that can offer recommendation when asked to fulfil directly and indirectly the user’s intentions. A conversation with this assistant can result in a dialogue. (If you ask for a recipe, it will show you a recipe and advise you to buy missing ingredients if necessary)

Collaborative

An assistant that can work with users to formulate intentions and then fulfil them in a cooperative style. A conversation with this assistant can result in a dialogue or debate. (If you ask for a recipe, it will show you a recipe and advise you to buy missing ingredients if necessary, then start cooking with you e.g. Turning on the oven)

Autonomous

An assistant that can do self-governance and act on its own to fulfill preprogrammed user’s intentions. A conversation with this assistant can result in a dialogue, debate, discourse, or diatribe. (If you ask for a recipe, it will just start cooking)

If we look closely, we will find out that context is needed more and more when we go from the 1st to the 4th archetype. There is also a correlation with having agency. The more we go up into the archetypes, the more we lose agency to our actions. We lose detailed control over the voice assistant. The voice assistant becomes more than an obedient agent for us, it might become “us”. (We will talk more about that in another blog)

VUI.agency – Mohamed Hassan – Is contextual voice experience the next breakthrough in voice user interaction? — Thanks to Ekaterina Bolovtsova for sharing her work on pexels

Voice is a very personal interface, it is the closest interface between humans and machines due to the fact that it is using “our” language hence our way of thinking.

Mohamed Hassan / Senior Voice / AI Experience Architect at VUI.agency

So, the design process for a voice experience needs to include contextual inquiry. We need to start integrating environmental and physical model into our contextual inquiry (Holtzblatt et al, 1998). That will help us understand the macro and micro levels of voice interactions and define the needed context parameters.

Secondly, it is a technical requirement, and the current voice technology and solutions need to include a context management system and recognition that developers can use. Developments in self-learning algorithms are necessary to enable that.

How do we design for contextual voice experiences at VUI.agency

At VUI.agency, we take contextual information into account when we design voice experiences. Starting from the inquiry phase until testing, the user’s context and the situation are at the center of our process. We try to push the limits of technology to achieve a smarter and more assistive voice assistant.

If you want to have smart assistants that help in our everyday lives and eliminate our frustrations, we need to invest more in new innovative frameworks and technologies.

Finally, businesses and brands should focus more on exploring new grounds for voice and surpass the current technical limitations.

Contact us if you want to know more about context in voice.

Share this article

Mohamed Hassan

VP Design & CX

Loneliness, Elderly, and Healthcare Voice Assistants

Sara Oliver G.V. 01.12.2021

Could a voice assistant help alleviate the feeling of loneliness in our elderly? Loneliness and the elderly are very familiar concepts that most of the time, sadly, go hand in hand. To answer this, we must first use our empathy to understand this feeling’s why’s and nuances.

Read blog post

Data and AI in voice – An interview with senior voice architect Dr. Laura Dreessen

Dr. Laura Dreessen 30.08.2021

How do we use data and AI in VUI.agency’s voice projects? First and foremost, we use data to train voice recognition since every voice interaction starts with understanding. For this, we need speech data collections to understand how our user’s persona speaks.