Joan Bajorek Voice AI: The Future is By Everyone and For Everyone

Automatic Summary

Exploring Voice AI: The Future is by Everyone and for Everyone

Hey there, I'm Joan Paula Bajorek, the head of conversational research and strategy at Versa, a leading voice agency. I am also the founder of Women and Voice, an international empowerment organization for women and gender diversity in voice technology. I am thrilled to be here and share my insights on the ever-evolving Voice AI landscape.

The Future of Technology as Seen by Futurists

In an age where we're increasingly relying on technology to simplify our tasks, one quote by Mark Cuban resonates well: "There's no future that doesn't have ambient computing or voice activation". This highlights the integral role of voice technology in setting the stage for the future.

About Me

Born and raised in Seattle, I am a linguist and a researcher, having completed my Ph.D. from the University of Arizona. As an Alexa Champion, I am committed to fostering the growth of the voice technology ecosystem.

Overview

In this article, I will be addressing key themes such as Voice AI, the ever-growing ubiquity of voice technology, multimodal, and integrated builds for all.

Understanding Voice AI

Voice AI, a form of artificial intelligence or augmented information, interprets words and even sends them out. To put it simply, it's like normal conversation - the only difference is that your partner isn't a human but a computer.

The Ubiquity of Voice Technology

Google reports that 20% of their searches are made by voice query, indicating a shift in user behavior. Google and Amazon have been dominating the IoT market with millions of their voice-assisted devices being sold and used worldwide.

Moving Towards Multimodality

The future of tech lies in multimodality, i.e., leveraging multiple modes such as text, audio, visual video, augmented reality, virtual reality gesture, and voice inputs/outputs. To realize this concept, we need power-packed hardware that can process the large amount of data required in a diverse set of applications.

Tech for All

When we consider tech for all, we must consider the diversity in our population in terms of gender, race, language, and physical abilities. Currently, there are significant biases in the recognition of race and gender by voice recognition technologies.

Prospects for Improvement

Improving the inclusivity of these systems could result in significant financial benefits, as an inclusive system could cater to a wider range of consumers with diverse purchasing power.

Voice Technology in Action

Companies worldwide are creating innovative solutions using voice technology. Some examples include Soapbox Labs, a leading developer of voice tech for kids, and Versa, working on projects with renowned companies like Dominos, Huggies, and Coca Cola.

Building a Global Community with Women in Voice

Through Women in Voice, I aim to create an empowering and inclusive community for women and minority genders in voice tech. With chapters globally, we organize tech events, webinars and hackathons, and provide opportunities for networking, collaboration, and professional growth.

Are you interested in diving deeper into the world of Voice AI or becoming a part of a thriving global community? Feel free to get in touch with me or explore our resources on womenandvoice.org. Let's create a future where technology is by everyone and for everyone.


Video Transcription

So this talk is called um voice A I. The future is by everyone and for everyone, uh my name is Joan Paula Bajorek. She, her, I'm the head of conversational research and strategy at Versa, which is a voice agency.I'm also the CEO and founder of women and voice. And I'll talk about both of those things today. I'm so jazzed to be here, especially during this tumultuous time. It is amazing that the women tech network is here and to meet and talk to people from all around the world um brings me a lot of joy. So thank you so much for coming. Um Let's move into it. So um perhaps you know about Mark Cuban. Um but he said there's no future that doesn't have ambient computing or voice activation. None. And so I think uh especially he's a futurist, he's an investor, but thinking about the future of technology and what voice technology has to play a part in this whole ecosystem. Um So a little bit about me since you probably don't know me. Um My, like I said, my name is Joan. I'm a linguist and I'm a researcher. Uh I live in Seattle. Greetings from Seattle wherever you are around the world. Um I previously worked at NUANCE. Um I got my phd at the University of Arizona. A lot of my research publications have been published in high profile places like Adobe XD, um Cambridge University Press, Harvard Business Review. Um I got to speak at CE S this year, which was super awesome.

Um I participate in Voice Summit and I'm also an Alexa champion. Um All these things to say is that I'm really connected with the voice ecosystem. I'm really thinking about and talking to different organizations who are building the future of voice technology and what technology looks like in the voice conversational ecosystem. So, as I mentioned before, um I'm the CEO and founder of Women in Voice, which is an empowerment organization internationally um for women and gender diversity in voice technology. I'm also the head of conversational research at Versa, which is the largest voice agency in the world.

I'll talk a little bit more about our projects as well. Um The eye light of today's talk. So I'm gonna be talking about voice A I, I'm gonna be talking about innovation and ubiquity of voice technology. I'm also gonna be talking about multimodal and integrated builds for all.

And I think when we think about tech, we talk a lot about impact. Um And I think I'm gonna, I'm gonna talk a little bit about problematic who is able to use these tools right now and who might be in the future? OK. What is voice? A I here? Oh goodness. Um So voice A I, so when we think about voice, we think about communication through um an oral modality. So a conversation um there's listening and there's hearing um with humans. And so when we think about artificial intelligence or augmented information, how I think about it, um we have the computer um both interpreting words and also potentially sending words out so that the input is voice and then processing on the comp computing side and the output potentially is also voice, having this interaction.

Um And one of the most sophisticated versions of this to date um for end consumers is Google A is Project Google Duplex if you're in the voice ecosystem, maybe, you know. Um but um duplex is a system where the Google Assistant can interact with someone over the phone so that um you know, how can I help? And the person would ask, make me a haircut appointment on Tuesday morning, anytime between 10 and 12. And the Google Duplex will actually call the um hair salon on the phone and make that reservation for you having what we call a multi turn interaction. So there are different pieces of the conversation going back and forth and this has beaten the turing test that people actually think that they're speaking to another human. Um Google duplex puts in pauses and puts in um different um ways of iterating the information back and forth to each other. So it's a really cool technology. It scares people um because it sounds so realistic that um a voice assistant, the computer can sound so realistic. Um But I think the implications of this technology doesn't report that it's a robot and not a human. Um It, we, we talk about this is that Google TXS is a pretty narrow use case.

Um It's a very specified like you're looking to book an interaction or you're looking to go to a restaurant. Um It cannot support all the conversations of the world quite yet. Um But this has really pushed the bounds of what voice A I is today. Um And maybe you haven't heard of Google Duplex, but the ubiquity of voice technology is growing, right? Um For example, search by voice Google reports that around 20% of their searches are made by voice query today so that people are speaking into their chrome or into their um mobile devices. And I've already seen this in mind, but Google assistant will replace voice search in chrome in 2020. So that instead of a search bar, right, you can voice query material into Google. Um Similarly with Google and Amazon being the Behemoths in the United States. Um Alexa Devices, there are hundreds of millions of IOT purchased. Um and they announced it ce s that hundreds of millions of weekly interactions with smart devices So here on the right, we have, you know, the, the Google um assistant as well as a different um Amazon, Iot, there are so many around the world.

Um Maybe you've tried them um at 1.1 in six people in the United States had one. There's just a lot of people um focusing on and just the ubiquity of these devices, the fact that you can literally ask for search things in the future. Um And, and now not only in the future, but how common this is, I think the ubiquity I'm often told like voice tech is that a thing, do a lot of people interact with that? In fact, you may already and you don't know that you are potentially. So the future also is not only with voice tech specifically as we think about here's a a Google home product um where you're literally looking at what is basically a black box, right? That is listening to you with a smart speaker. Um But in the future, um we will be thinking about multimodal and this is a, a common term in, in my field, multimodal is thinking about leveraging multiple modalities modes. And that could be all, all kinds of different inputs and outputs um with the computer. So text, this could be audio, this could be visual vid video. Um This could be augmented reality, virtual reality gesture, voice inputs, outputs right now. You are hearing my voice, hopefully, you're seeing the slides visually. Um and how that is interpreted by the computer that you might be seeing

on your screen.

So here are some other examples and we have some Google hardware here with um smart home devices as well as on mobile. Um on the top, right, we have examples of gesture and vision. So that by using computer vision, potentially in a medical setting, you could augment images to see them differently. Um And that the computer is interpreting your

gestures to

in these different multimodal um environments where you could also be speaking to the computer as well. On the bottom, right, we have augmented reality where the device augments, right? What is in your reality? This was, you know, famously done with Pokemon Go, right? Augmenting the reality.

Um there have been problems with hardware is not prepared to do the live stream streaming and a huge amount of data um in computing that is expected with a lot of these systems. But this is the future of our tech that really fits the use case, right? If it's a furniture buying adventure, you know, it really makes sense for that if it's in a medical environment and so forth. So this is multimodal. Uh here's some more really cool examples of a voice technology in in a multimodal world. So on the top left here, I believe this is from Hong Kong. These are other kinds of smart devices um that came from Alibaba and Baidu um which are big companies out there. Um They're really thinking about, do you see your media in this? Are you consuming news? How are you interacting with these systems on the top? Right. Here's example, from Kuwait um using Alexa to um this woman who's a technology blogger. Um He uses Alexa for her prayer practices. So, using these in different ways, my company at versa in Australia, we um worked on um a collaboration with Domino's where you could on the Google Assistant. You can ask Domino's to send you a pizza. And if you're an authenticated user, it'll send you a pizza literally. Uh And my CEO there has done demos with this where, you know, she orders a pizza, the pizza comes so quickly. People are shocked um and think that it was like a prank.

It was not a prank where it's just, it goes that well in the system that it's basically a send and the pizza get, gets ordered through the Domino's system. So check it out. Um On Google Assistant, there's also other systems on the bottom right here where the there's a um smart device in a home environment that turns to see you in a different thing. So it's using not only computer vision as well as speech, but it's really being thoughtful about, you know, is the screen pivoted toward you. Can he hear you? Well, is the microphone a 180 degree mic and so forth So I really think these are some really cool examples. I think all around the world of people using devices in different ways. Uh Lastly, on this one, thinking about multimodal and those inputs and outputs that I was talking about can be in different um modalities. Uh This is an example of American sign language using Alexa devices as well. So in this case, um Abhishek Singh um used computer vision to interpret signs from the user. So American sign language uses um gesture in that sense that that is put into the Alexa system. Um and then voice and screen and back to text so that someone who um communicates through American sign language can use the Alexa system, which is a really cool use case.

Um And also thinking about when we're blending, what does, what are these inputs and outputs, what different languages are being used in this multiplicity of environments. Uh And so this is where tech for all, especially with um American sign language is one example of something that it is a language, but maybe it's not one we traditionally think of when we're thinking about different languages. Um In my work, I really think about how tech is or isn't working for lots of different people. So tech for all, um specifically my um most famous publication and the one I'm deeply proud of is was published last year at Harvard Business Review. Um It was so well loved that they put it on the home page and they even translated into Brazilian Portuguese, which just, this was one of my comp papers for my phd. And I'm just deeply proud of being able to hopefully help people interpret better what bias really means in these contexts. So the articles titled voice recognition, still a significant race and gender biases. And specifically, it looks at some research from the national um N A AC L um as well as there's been an updated report of this um for um in at Stanford looking at black and white speakers, but this idea that we have significant race and gender biases of what current computers can understand.

Um And one of the big things I talk about in this paper is that it's not intentional. I really don't believe that there are people at these companies being like, oh, let's design something that's horribly biased against people. No, in fact, there are huge incentives for people to be um making the best systems possible um economically. Um Yes. Um moderation panel. Oh, I'm not sure what's going on there. Um I'll continue. Oh, Shanie, maybe this is the next speaker getting ready. Um Looking at, sorry, we're looking at many, many screens. I'm not a speaker. Sorry.

Oh, are you the moderator for this panel? No, no, no. Oh, ok. I'm just joining here.

Lovely. Ok. Well, um I am just continuing to talk about uh gender and dialect bias. Um This is a very famous paper from 2017 that looks at accuracy. Um I don't know if you play with GG plot two and that's I I my, my coding languages are, but this is what this was made in. Um And you can see different accents from California, New Zealand, um New England, Georgia and Scotland.

So different people speaking these languages

uh speaking these dialects, excuse me. And that it's 13 percent across these more accurate for women than it is for men. Um And for mixed race speakers, it's potentially 10% lower in accuracy. So this is the 50% line. Basically, these systems are not performing that well is the answer to most of this. Um And if um this is not exactly interpretable for everyone, I made a scorecard that's potentially a little bit more interpretable. Um So for example, I if um three people are all native speakers of English, um and the only real difference is who they are as a speaker. Um And they're reading the exact same paragraph, my friend Josh, who's a white male might get an A minus on how well the system understands him and understands almost everything. He only needs to correct 8% of what's going on. If I speak in the same system with the same paragraph as a white female, I might get ac plus. So a 79 of how well it understands me. Um And even worse than this, Jada, a mixed race, they might get ad plus of how well the system understands her. So I think I, I talk about this type of research, um, at tech events and at dinner parties as well and people say, oh, this explains it.

It always understands Gary and never understands me. And I think one of the things I struggle with terribly is that people believe it's their own fault that the tech doesn't understand them and that's utterly not the case. Um Most of these systems are just deeply biased unfortunately, and we are working on that in the speech um ecosystem. Um And when we think about why to make a system better, not only is there an obvious moral obligation um but there's a huge amount of money in these type of iot performing well for everyone. So, um for example, um one of the examples I think about often is my coworker who um speaks Spanish and English and he and his wife have a significant amount of purchasing power. Uh They considered buying a smart fridge, which I don't know if you realize exactly how much smart fridges cost, they cost a lot of money. Um But he decided not to buy a smart fridge because he knew his wife and he would not be understood by the smart fridge and the voice recognition in it.

Um And I think this is especially in the United States, $1.7 trillion in purchasing power for the Latinx community in the United States. So there's a huge amount of money that they have to purchase different iot that they want to. And also there's a huge growth index that the Latinx community um accounts for almost half of the growth of the US population. So really think about who are you catering to? Is it prepared for them? Will they be buying it? Are they preventing factors to that? Um And this also is spoken about by um Melinda Gates, huge missed financial opportunities, financial blind spots. I think I I'm preaching to the choir. You all are at the Women Tech Global Conference. But I think it's really important when we think about incentives to other people who might not come from our perspective or really need um the case to be made for why to consider these opportunities, financial blind spots. So she says we care about diversity, but we really care about how much money we make. Women are. 85% of consumer dollars spent. Women control 70% of the financial decisions in the house. So you're missing an opportunity because you don't see it. You, so you're leaving money on the table. You're not in the deal flow. I think this is extremely compelling, especially from someone um as successful as Melinda. But I think really a articulating in that way can be extremely powerful.

Um With we started at 20 it ends at 40. I have three minutes left. I have some examples of different um cool start ups in the ecosystem. They're working on voice technology for kids. Check out soapbox labs, they just raised 6.5 million um novel effect, especially if you have kids. Um This company does soundscapes behind um your reading of a book. That's really cool. Um They're based in Seattle. I know the founders, they're phenomenal. There's also when we think about voice technology for an elderly population, you know, are we designing tech that is for the 50 over cohort or depending on how, what you think of elderly as clearly. Um That's some people live much, much longer. Um But really thinking about tech that is designed for an older population and what their needs might be and how voice tech or technology in general is prepared to support them. Um Lastly, on some of these example ones, I think um common voice is a project out of Mozilla Mozilla Firefox you may have heard about um Mozilla is working on projects that teaches machines how real people speak instead of learning how to speak. So Alex understands you, for example. Um and they make voice data freely and publicly available open source, you can donate your voice and they um really think about what demographics are compiled in these data sets. So I've donated my voice.

Check it out Common Voice is the name of the project um through Mozilla. Um Lastly, I of course, have to also talk about my amazing company versa based in Australia. Um We've done projects with Huggies with Dominoes with Beyond Blue, which is around mental health. Um We've worked on a maternal, maternal and child health care apps and conversational bots as well as Coca Cola. We've done some promotions with them. Um So if you're interested in hearing more about conversational A I about voice technology, about how to support your brand or you're just interested in this, definitely get in touch. Um Brisa, we do some amazing work. Um Lastly as the CEO of women and voice, I have to tell you about my amazing community. We have 12, I think almost 13 chapters around the world. Um all across the globe. Here's an example from a Mexican hackathon. Um Here are some of my Japanese ambassadors and here is a tech event in Madrid um at the Google for start ups. So there's a lot of cool work we're doing. If you have interest in this, if you're pivot into voice, um We're an international empowerment organization for women and minority genders in voice tech.

We're all about building community, amplifying, providing professional development, empowering and

celebrating people, diverse people in

this field. So check it out. Um Our website is women and voice.org. We're typing women and voice. We're really active on Twitter and linkedin as well as Slack and some other forums. So this has been with A I. The future is by everyone and for everyone. I hope you have enjoyed this. Um I look forward to being in touch again. Um You can find me and all about this work uh on social media. Um I've put my details in the chat and I'll do that before I leave as well. It was lovely speaking with all of you and I look forward to continuing the discussion.