Personal Digital Assistants - The Future of Ubiquitous AI

Personal Digital Assistants

The Future of Ubiquitous AI

Marc Ruef
by Marc Ruef
time to read: 18 minutes


  • Personal assistants are software solutions that are designed to take on concrete tasks
  • Through voice commands you can send messages or have appointments read out, for example
  • Recognition of voice commands generally occurs locally, with data processing in the cloud
  • For products to achieve wide reach, voice input must be natural and simple
  • Within a few years we can expect the kind of acceptance and distribution that smartphones presently enjoy

Digital personal assistants or virtual assistants are solutions that are intended to offer the advantages of an assistant in day-to-day life. Various products are competing for the favor of users. This article looks at the possibilities and future developments expected in an area that will exert a decisive influence on both the technological and sociological level.


Personal assistants are sold as dedicated software. You may have previously encountered them on your smartphone in the form of Siri or Google Assistant, successor to Google Now. After Amazon presented its own Alexa service as autonomous hardware in the form of Echo and Dot, Google and Apple have now stepped up with Home HomePod respectively. Ultimately, what’s at stake here is the dominance of the living room: The more naturally an assistant can be integrated into day-to-day life, the greater its acceptance and chance of survival. Microsoft has now integrated Cortana into the Windows operating system. Apple did so with Siri in macOS, and sooner or later other manufacturers will be keen to follow suit. It is now common for apps to be positioned on competitors’ platforms, even though restricted integration into the external ecosystem means they are only able to offer a fraction of their normal options.

These solutions are primarily based on audio communications: They wait until they hear a keyword, and they are then ready to accept commands through voice entry. The results are preferably returned in audio form, but sometimes just as a list of web search results. Through commands, users can make calls, dictate messages, manage calendar entries, listen to music and play games, for example.

Both Siri and Allo (an alternative connection to Google Now) also allow text input on the device. In terms of architecture, this means that you can separate out the input and output functions of data processing and recombine them however you wish. Command input via OCR is just as feasible as image recognition (e.g. as a basis for sign language). This allows disabled-ready usage of the latest electronics.

Product Manufacturer Main platforms Strengths
Alexa Amazon Echo, Dot, Kindle Fire Purchase, additional Skills
Cortana Microsoft Windows Phone, Windows general, Windows integration
Google Assistant Google Android, Home Knowledge, coverage, precision
Mycroft GPL Linux open source, Linux integration, Raspberry Pi
Siri Apple iOS, macOS Apple ecosystem integration

This is where we encounter the first security concerns in relation to privacy. For your assistant to hear your keyword, it must constantly listen in on your conversations. In Europe especially, this has met with persistent skepticism. And not just because Alexa was called as a witness to a murder. To lay these concerns to rest, many products offer physical activation for voice input. The iPhone, for instance, requires the home button to be held down. Amazon Echo even supports complete button-based deactivation of the microphone. Experience shows that speaking Swiss German dialect is a good way to avoid being systematical, automatically bugged, for the time being at least.

Manufacturers promise that they are not collecting data, even when continual listening is activated. Nor would this be desirable from an ergonomic point of view. Amazon Echo and Apple Siri, for instance, a record for a maximum of three seconds in order to recognize the keyword. This is evaluated by a local chip in order to ensure minimal latency. Only after it receives the keyword does the device send the voice input to the cloud for comprehensive analysis, evaluation, and processing.

On the other hand, it is in the providers’ interests to collect as much audio data as possible. It is also conceivable that manufacturers would wish to retain audio recordings for longer periods. This would allow them, for example, to passively improve speech recognition, which would, in turn, be of benefit to the user. But it would also assist them in identifying words, speakers and the ambient sounds around them. Because all of this data has value, not least to the advertising industry, which will want to make use of it. There has recently been a discussion on the Internet about people receiving spontaneous personalized advertisements for products following personal conversations about related issues. This would, of course, be an unacceptable infringement of privacy. It would be technologically feasible, commercially it might even be desirable. And as with so many infringements of privacy, it is only a matter of time until this one, too, overcomes widespread opposition.

Restrictions and breakthrough

Personal assistants are linked to artificial intelligence (AI). It is AI that is tasked with providing a correspondingly intelligent response to the input. It is therefore important that the input processing is as smart as possible. If not, users will perceive the device as dumb and reject it.

When it comes to the quality and quantity of data, Google naturally has a competitive edge. It will be impossible for providers like Amazon to shorten this gap in the near future. Instead, providers may well decide that purchasing additional data inventories from their competitors (e.g. IBM Watson is a very interesting) option or they may obtain it directly from the market leader. Amazon has found a way around this hurdle by primarily configuring its products to facilitate straightforward purchases from its online store.

The way we evaluate the intelligence of an assistant is subjective and highly dependent on the command input. So the quality of speech recognition is enormously important if the device is to succeed. This begins with the audio quality offered by the microphone. Here Amazon Echo has put in some solid prep work by fixing its devices with multiple microphones which ensure optimal speech recognition. This is a purely hardware-related issue. And Amazon is also setting the standard in relation to response time (recognition of beginning and end of speech input).

But even more important is the speech analysis itself, that is the ability of the system to recognize individual words and transform them into comprehensible instructions. Unlike microphones, this is a software issue. The software in question is not currently installed on the end device. Instead, speech data is always sent to the cloud and evaluated there, where there are plenty of resources to hand. This cloud-based approach also allows manufacturers to implement optimization without forcing users to purchase new products or install patches.

Ultimately, personal assistants will only become a part of day-to-day life when they can be used naturally. Users over 30 often find it difficult to talk to a computer at first, despite the last 20 years of smartphones and voicemails. For the Swiss, there is the additional problem that they have to issue instructions in a foreign language. Even if it is only standard German. But still, it feels strange to have to suddenly formulate individual phrases in a foreign language. Naturally, this is likely to present less of a barrier to English-speaking users.

But the main problem ailing present solutions is the unnatural handling that is sometimes expected from the user. German speakers, for example, might expect Alexa to play songs in a random order on hearing the phrase Starte Zufallswiedergabe (“begin random play”). But Alexa doesn’t recognize this instruction. Instead, the user has to use the English equivalent, Shuffle, which ironically is acknowledged with the words Zufallswiedergabe aktiviert (“random play activated”).

This is most likely just a failure to make speech commands sufficiently German-friendly. But if you wanted to start a playlist entitled Best of Rock, for instance, then it’s not enough to formulate this as something like: Play Best of Rock. Instead you have to explicitly say: Play playlist Best of Rock.

You can find plenty of further examples for every platform. Collectively, they show that it is still the user who is expected to conform to the peculiarities of the device. And when you have to first consider how to formulate an instruction, the interaction becomes clumsy and unnatural.

In the case of the Alexa Skills, this is generally quite deliberate. The Skills, comparable to apps, can be loaded onto an Echo or a Dot. The number of Skills has grown from 135 in Q1 2016 to 10,000 in Q1 2017. Various news portals, music services, and websites offer their own Skills. For example, you can have your device open Wikipedia and then look something up. The major restriction here is that these Skills must be specially summoned. So the instruction would be: Alexa, open Wikipedia and search for Cybersecurity. This example may seem harmless. But things get more difficult the more Skills you install and the less often you use them. Here, too, you are forced to stop and think about the right way to formulate the question.

Amazon would do well to implement a keyword for controlling Skills. But they would have to be summarized into product categories and prioritized, either by Amazon or by individual developers. So when you say Play rock music, Alexa would check Spotify as well as all the other Skills in the Music/Radio category for possible hits.

Concrete Visions for the future

Today’s personal assistants are primarily a combination of microphone and speaker which allows interaction. With the Dot, Amazon has attempted to provide a competitively priced device to achieve coverage in every room in the house. But it is important that individual devices (Echo or Dot) function autonomously and do not interfere with each other. The device that recognizes the speech command most readily will respond. There is still no true device synchronisation, nothing like the kind of multi-room system you find with Sonos or Bose. But this would be the next obvious step.

One fundamental problem for present-day solutions is security. There is no true authentication. While Siri can be configured so that the user has to first log on with a smartphone in order to issue commands, this is an undesirable hurdle which significantly diminishes the user experience. With Alexa, it is possible to set up multiple accounts and switch between them, but this does not require a password. As such, third parties may be able to access and manipulate them on the spot. Many households are therefore adopting a generic user account with no links to sensitive personal content. But this, in turn, diminishes the functionality. Ideally, we will have user identification through voice recognition in the future. Pitch, tone and speech patterns could make it far more difficult for third parties to issue instructions.

Note, that in spite of the discussion over the intelligence of systems, these systems are still computers. They are networked, and access cloud-based services. Thus, the protection of privacy is as important as hardening and patching of components. By deactivating individual functions, data leakage can be curtailed in some cases, but due to the architecture of modern solutions, it is not possible to prevent completely. However, this architecture at least has the advantage of allowing automated patching of the components, assuming that service vendors follow this practice in a sustainable fashion.

The functionality of solutions – what we refer to as their intelligence – will increase significantly. Speech recognition is improving all the time and more and more data can be linked up. On this important point, it is the product with the best and fastest growing possibilities that will win the competition. A particularly open solution may offer a decisive competitive advantage. With its Alexa Skills Kit SDK Amazon is already well advanced, allowing the creation of tailored Skills. It is likely that Google, too, will pursue a more open strategy in order to link up with Echo and Siri. Complete open-source approaches, such as the Mycroft project, will probably be the exception.

But overall, systems will have to be capable of learning. And not through programming, but in dynamic exchange with the user. So after repeatedly asking play Best of Rock, if an album of this name is played and the instruction is corrected to Play playlist Best of Rock, then after the second or third time, the system should assume that the user has the playlist in mind and not the album.

The key to the kind of widespread success enjoyed by the Internet and the smartphone lies in natural interaction. As soon as you can take advantage of the possibilities easily and at each and every step of the way, just like the computer in Star Trek -, use of personal assistants will become a staple of day-to-day life. But that could be five to ten years away. In the interim, these solutions are of undeniable benefit to people with disabilities (dyslexia, visual impairment, paraplegia).

About the Author

Marc Ruef

Marc Ruef has been working in information security since the late 1990s. He is well-known for his many publications and books. The last one called The Art of Penetration Testing is discussing security testing in detail. He is a lecturer at several faculties, like ETH, HWZ, HSLU and IKF. (ORCID 0000-0002-1328-6357)


You want to evaluate or develop an AI?

Our experts will get in contact with you!

Specific Criticism of CVSS4

Specific Criticism of CVSS4

Marc Ruef

scip Cybersecurity Forecast

scip Cybersecurity Forecast

Marc Ruef

Voice Authentication

Voice Authentication

Marc Ruef

Bug Bounty

Bug Bounty

Marc Ruef

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here