Learning to speak machine

Persistently connected and pervasive devices form an increasing, and arguably ever-more essential part of our future.

Our homes can be managed by them, they help us to quantify our health, our pulse rate, our activity levels, our calorie intake, they keep us connected to news, social comment, the weather, the train timetable, our working lives, our communication with our friends and family, our tribe, our social selves.

Amazon Echo Dot courtesy of Amazon

The Internet of Things

This hyper-connected world, with up to 120 billion devices operating within the Internet of Things in next 10 years has been born from alignments of multiple fields.

Technological advancements in voice recognition, immersive audio technology, artificial intelligence and haptics are resulting in novel and increasingly intuitive device interactivity.

Big data has enabled voice recognition due to the huge amount of information available to refine and improve the acoustic model and the language model. Research firm Markets and Markets predicted in 2016 that the speech recognition market would reach $10 billion by 2022.

Object-based audio

Immersive sound takes the traditional multi-channel surround experience to a whole new level by using height or presence channels (speakers mounted high on walls or a ceiling) to create a dome of sound.

The resulting “3D” audio effect is ushers the listener that much closer to experiencing artificial sound that mirrors real-world auditory experiences. While traditional channel-based audio can be used to create immersive sound, a new type of audio encoding called “object-based audio” is proving to be highly effective.

Object-based audio allows Hollywood moviemakers to program an audio mix that pinpoints sounds to specific areas of space within a room. Properly equipped AV Receivers (AVRs) can decode a film’s object-based encode and use available speakers to best replicate sound placement as intended by the filmmaker. It means that when a gun fires, when a character laughs – it feels like they’re right there in the room with you. In business, immersive sound can combat open plan office issues, making it much easier to communicate and collaborate with those that aren’t in the office.

Future interfaces deliver content in more meaningful, relevant ways. Content can stimulate all of our senses and be delivered at contextually relevant times.

Audio is expanding not only the internet, but also our relationship with the content itself.

The value of services delivered via pervasive and connected devices will be immense, and has the potential to directly affect both developed and developing countries. In their 2016 report “The Internet of Thins: Mapping the Value Beyond the Hype” consultancy firm McKinsey forecast the global economic value of IoT services to be between $4 trillion and $11 trillion by 2025.

The sectors most likely to be directly impacted by pervasive personal devices – Human, Health, Vehicles and Offices will together deliver between $550 billion and $3 trillion in value over the same period.

Overall, McKinsey’s analysis suggests that over 60% of the value derived from IoT services will be delivered to the developed world. Despite higher numbers of potential deployments in developing countries the economic benefit, either in consumer surplus, customer value gained from efficiency savings or technology spend, will disproportionately be delivered to developed economies. The disparity is even more stark in terms of the categories where pervasive and connected devices are likely to be the medium of delivery or interface to IoT services. In the Human category – which covers monitoring and managing illness and wellness – almost 90% of the value derived will be in developed nations, partly in recognition of the fact that health-care spending in developed economies is twice that of developing economies.

But, does the attractiveness and seductiveness of our apparent ability to engage with these pervasive devices have a more challenging side?

What are the implications for pervasive listening devices and privacy issues? What are the ethical considerations we should be levelling at the providers of voice-recognition and AI platforms? If a home assistant “hears” a violent struggle – should the police be called? What if it “hears” a fall or the sounds of someone in pain – should it alert medical providers? Do we “speak” differently to listening devices – and if we do speak differently to them does that new mode of speaking transfer into other communication spheres? Does learning to speak machine make us also speak differently to each other?

And are we confident that, even if we develop the ethical, procedural and privacy controls that mean we can feel comfortable with pervasive listening devices engaging in communication with us, the human side of the equation will prove as rule-abiding as the machines or that we will even continue to be able to understand the exchanges we are making, or that the machines are making with each other? Microsoft, Google and Facebook have all been experimenting with AI-generated communication and voice interaction – with results ranging from the obscene, hate-crime-referencing abuse taught to Microsoft’s AI avatar TAY by exposure to the Twitter-sphere, to the more benign but ultimately endless-loop ramblings of Vladimir and Estragon, the Google Home chatbots (not the Beckett characters), and the development by Facebook’s negotiation AIs of their own language, evolved from written English but ultimately indecipherable to their programmers.

A voice-user-interface (VUI) to an AI-enabled processing platform represents an unprecedented degree of abstraction of programming commands. It doesn’t even feel like it’s a computer you are engaging with. That’s part of what makes VUI-AI so extraordinarily accessible to so many, even those who might otherwise have struggled to engage with technology. It is a sure-fire way to bridge the digital divide, but we have the potential to end up in the position of a massively inexperienced consumer base using highly advanced technology with an ever-reducing requirement to actually learn machine-level programming languages at all.

Speaking to each other

Our ability to use language, meaning and nuance in the negotiation of infinitely subtle social contexts is one of the reasons that “speaking machine” appears so superficially attractive but quickly proves so dissatisfying.

It’s the reason why so many smart-speaker devices, despite their significant processing capability and an ever burgeoning set of “skill” extensions, are used as an expensive and comparatively mediocre-quality music player. Computers don’t understand language. When we learn to “speak machine” we adopt verbal cues in fixed or pre-ordained pattern ranges of command selection. We have become our own verbalised command line.

Computers are extraordinarily capable of processing massive volumes of data and performing almost unimaginably complex calculations within specific parameters. Humans are extraordinarily capable of the adaptive processing and synthesis of entirely novel sources of information in alien contexts.

We will all get better at “speaking machine”, and they will get much, much better at listening to us. But we shouldn’t let the beguiling responsiveness of technology limit our passion and enthusiasm for speaking to each other – or allow the apparent shrinking of communication distances to hold back our natural instinct to search out and experience the diversity and variety of our world.