For Tarka to work effectivley and effieciently, it must be able to identify the different tones that people speak in when they are feeling a certian way, and then react in an appropriate manner that provides the user with a sense of acknowledgement, as if Tarka had replied to what they had just said.
Who’s Doing What Today?
- Professor Eleni Stroulia and PhD student Mashrura Tasnim in the Department of Computing Science at the University of Alberta have used standard benchmark data sets as well as several machine learning algorithms, to recognise depression using acoustic cues. The way that the app would do this is by collecting voice samples as the user speaks, and then use those samples in order to recognise and track indicators of mood. This is the same method that I used to develop the prototype AI software. (See AI Prototype Development)
- Insurer Metlife are using a program called Cognito in call centres that is able to distinguish when call centre agents are feeling tired/low on energy as well as customer reactions by using voice-analysis algorithms. Cognito monitors speech distinctions, such as long pauses, interruptions, conversation flow, vocal strain and speedy chatter. It also monitors voice signals, for instance, callers that sound annoyed, disinterested, or confused. It then uses this information to help solve the ‘communication gap between customers and agents.’ Cognito is also used by Zurich Financial, where the primary language is German, this shows that the non-verbal cues are more reliable than analyzing the words that people say.
- Moxie is a social robot created by Paolo Pirjanian and Maja Mataric that helps social, cognitive and emotional development in children. The gives the children tasks to complete based on a gamified narrative, which is that Moxie has been sent from a secret labratory to learn how to be a better friend. Moxie’s head has microphones and cameras that pass on data to machine learning algorithms, so that the robot is able to hold a nautral conversation, recognise users and look them in the eye. Most of the data is processed on an onboard CPU with the exception of Google’s automated speech recognition software. As Moxie interacts more with the child, it gathers more data (most likely voice and image samples) so that it is able to have more sophisticated interactions with the child and recognise the childs developmental needs. Moxie sends the data that it gathers to an app that parents can monitor, and will provide recommendations. For example is it picks up on a recurring verbal tic it would suggest the parents take thier child to see a speech pathologist. My project would use a similar method of gathering voice samples in order to improve its responses.
- Oto, a derivative of SRI International, are working on a new speech recognition software that uses voice intonation technology, to initially enable call centers to better understand the vocal emotions of callers and sales agents. Their technology is called DeepTone which is based on deep neural networks trained on hundreds and thousands of real converstations to that it can pick up on tiny variations of emotion in speech. The tiny variations are described as ‘latent speaker states,’ and allow the tone in emotion to be picked up on in real time, many times per second. Intonation would be the way that my project would recognise different emotions.
Critical Functionality:
The way that the AI program would work is by analyising voice samples and picking up on differences in pitch and contour pitch (which is the change in frequency when something is being said) in order to distinguish tone. This is known as intonation.
It would pick up these samples using a microphone located in the otters head and then, using a WIFI module it would run these samples through the algorithm on the cloud, as well as saving and storing them. Then based on what the algorithm stated the otter would make a noise according to the tone of the users voice. A flowchart of this process is located below.
Utilising cloud servers to save and store the data would be beneficial as it means the algorithm would be constanly updating, and as result using more voice samples to determine a response, making it more accurate. (See Cloud Based Servers for the AI)
What Needs to Happen to Take this Into Reality:
Currently I have a working machine learning sofware using Teachable Machine that is able to use recorded voice samples to determine what tone the person is speaking in. Below is a video from the ‘AI Prototype Development,’ section. (Please see that for more information) There are also examples of this technology working above, in multiple different scenarios.
The Gap: How to Provide Responses, Based on the Prototype AI:
The next step from this point would be to export the code that determines these outputs and add If Statments as well as a Random Number Generator Command between a specific set of values to determine which otter noise would be played based on the tone of what the user has said. For example IF the algorithm had detected that the user sounded annoyed, it would pick a random otter noise from a pre determined set of noises, using the random number generator as a response to what the user had said. For example the sound below:
The Gap: Making a Cloud Based Algorithm for the AI
After this had been done the code would have to be converted and uploaded to Amazon Sagemaker where it could be built and deployed as a fully functional cloud based algorithm. Using a server vendor for the AI would also come with additional benefits to making a server (See Cloud Servers for the AI for more detail).
The Gap: How the Product Would Work With the Cloud
After creating a cloud based algorithm, the physical product would have to include multiple different components in order for the product to communicate with the cloud and work in the way that it is intended to work. These components would be:
- Microphone: Used to record the voice samples
- WIFI Module: Used to connect with the cloud and the algorithm for determining tone
- Speaker: Used to play the sound back to the user
References:
- Rob Matheson, (20.01.2016), ‘Watch your tone, Voice-analytics software helps customer-service reps build better rapport with customers.’ MIT News Office: http://news.mit.edu/2016/startup-cogito-voice-analytics-call-centers-ptsd-0120
- Luke Dorhmel, (18.12.2019), ‘Alexa and Siri can’t understand the tone of your voice, but Oto can,’ Digital Trends: https://www.digitaltrends.com/cool-tech/oto-voice-intonation-ai/
- ‘Sound mind: Detecting Depression through voice,’ EurekAlert: https://www.eurekalert.org/pub_releases/2019-07/uoa-sm071219.php
- Daniel Oberhaus, (30.04.2020), ‘Moxie is the Robot Pal You Dreamed of as a Kid,’ WIRED: https://www.wired.com/story/moxie-is-the-robot-pal-you-dreamed-of-as-a-kid/