Specific Criticism of CVSS4
Marc Ruef
There are a whole range of technologies that are considered capable of disruption, both on technical and social levels. Among other things, it is becoming apparent that in the future, artificial intelligence (AI) will become a central part of our lives. This article discusses the mechanisms that need to be established in order for AI to be perceived as intelligent and human.
In the age of personal assistants such as Amazon Echo and Apple’s Siri, the ability of AI to understand speech is becoming particularly important. Whether a user considers a personal assistant to be truly intelligent depends greatly on its ability to correctly understand the user’s instructions. The next step, i.e. the correct processing of the user’s request, is only possible if the initial instructions have been understood.
To achieve this goal, it is first necessary to implement a speech recognition system. This system must be capable of understanding the meaning contained within the user’s commands. The most simple method (and the method preferred by many AI programs) is based purely on patterns. If a statement contains the expression timer, a timer will be activated. Ideally, the command should be start the timer.
Yet not all users are able or willing to formulate simple, clear sentences. So it’s possible that the command is expressed in a roundabout way, such as please start the, um, yes, right, the timer. This is not a problem, however, because the pattern recognition approach means the program will simply focus on the word timer. In most cases, such commands will produce the desired result. Amazon’s Alexa is a very good example of this, which owes much to the principle of utterances with slots.
If a request contains not only the word timer, but also the word how, then instead of starting the timer, Alexa will announce how much time is remaining on the timer. Aside from the typical timer ⇒ new scenario, there is also the typical timer ∧ how ⇒ remaining scenario. It is the task of the AI’s database administrator to define the largest possible collection of patterns and actions.
In these examples, the AI’s ability to understand speech is minimal. The AI is based on stimuli and does not understand the real linguistic context. The latter is desirable, however, as it is the only way to improve the input quality and make it possible to introduce additional components (e.g. spontaneity, the ability to learn, feelings).
To achieve a real understanding of speech, it is necessary for each command to be linguistically dissected. At a minimum, it is now the task of natural language processing (NLP) to identify which part of the instruction is the subject and which is the verb. After all, there is a difference between saying you are hungry or I am hungry.
A database for the individual types of words can be used to identify them within the context of NLP. This requires the compilation of tables for the different kinds of words (nouns, verbs, adjectives, etc.). The respective columns must document all the different morphologies. In German, when it comes to verbs, the gender, case and number need to be taken into consideration. On the one hand, this is the only way to facilitate a pattern-based recognition of the individual words. However, it also provides the foundation for AI’s ability to understand the context and react to it, and to formulate grammatically correct answers. The following two tables show the conjugation as the active indicative for the German verbs Sein (to be) and Haben (to have).
Sein | Singular 1 | Singular 2 | Singular 3 | Plural 1 | Plural 2 | Plural 3 |
---|---|---|---|---|---|---|
Present | bin | bist | ist | sind | seid | sind |
Simple past | war | warst | war | waren | wart | waren |
Present perfect | gewesen | gewesen | gewesen | gewesen | gewesen | gewesen |
Past perfect | gewesen | gewesen | gewesen | gewesen | gewesen | gewesen |
Future I | sein | sein | sein | sein | sein | sein |
Future II | gewesen | gewesen | gewesen | gewesen | gewesen | gewesen |
Haben | Singular 1 | Singular 2 | Singular 3 | Plural 1 | Plural 2 | Plural 3 |
Present | habe | hast | hat | haben | habt | haben |
Simple past | hatte | hattest | hatte | hatten | hattet | hatten |
Present perfect | gehabt | gehabt | gehabt | gehabt | gehabt | gehabt |
Past perfect | gehabt | gehabt | gehabt | gehabt | gehabt | gehabt |
Future I | haben | haben | haben | haben | haben | haben |
Future II | gehabt | gehabt | gehabt | gehabt | gehabt | gehabt |
The algorithm for grammatical speech recognition now identifies every single word in a command. As soon as a match is found in the database, it is noted as such. This is relatively easy for simple sentences like I am hungry. Broken down into the different types of words, this is {noun} {verb} {adjective}
. This sentence can be recognized by a direct and simple pattern recognition program.
The grammatical dissection makes it possible to detect and change the tense, for instance. This is important, because it enables AI to display seemingly natural behavior during a dialog with a user. We published a Proof of Concept regarding this on GitHub. Among other things, our PoC is capable of recognizing word forms and their properties, so that it can then apply other attributes (e.g. turning an ‘I’ present tense sentence into a ‘you’ past tense sentence.
Things get more complicated when complex and convoluted sentences are used. Or if nominalization suddenly comes into play. The German tongue twister Wenn Fliegen hinter Fliegen fliegen, fliegen Fliegen Fliegen nach is a good example. In German, capitalization makes it possible to differentiate between a noun and a verb. This is much more difficult in English, as illustrated by the example Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
So to a certain degree, speech recognition is dependent on the user’s flawless use of capitalization. Wenn fliegen hinter fliegen fliegen, fliegen fliegen fliegen nach (all lower case) is much more difficult to understand. If the capitalization is omitted, the recognition of the types of words becomes somewhat fuzzy. Nevertheless, AI is still able to interpret the possible combinations of the different word types. In the preliminary pattern-based interpretation, fliegen is noted as both a noun and a verb. In the secondary interpretation it becomes clear that the sentence cannot consist of {conjunction} {verb} {preposition} {verb} {verb}, {verb} {verb} {verb} {preposition}
. The subject of the sentence – a noun – would be missing. Now the AI can try out all the different combinations of word types in order to identify a grammatically valid sentence. Admittedly, this may also result in several grammatically correct variations. But at least they are reduced to a minimum. And then of course there are always lots of exceptions, such as verbs without nouns and the impersonal passive voice.
The discussion so far shows that we sometimes expect AI to deliver a superhuman performance. Users who provide unclear or highly complex commands are often disappointed when they are not correctly understood or if the AI doesn’t react. But honestly, how many people are capable of directly and correctly executing such instructions on the first go?
It is becoming important at this point for pattern-recognition systems to be able to cope with a certain degree of fuzziness. This can be achieved using wild cards (regular expressions) or similarity analyses (e.g. levenshtein, similar_text, soundex and metaphone in PHP). However these approaches involve high computational costs and can make the speech analysis much slower. During our research, it became apparent that (grammatical) reductionism is the way forward. This involves, for instance, starting with the stem of the word and working towards the inflected ending, prefix or suffix.
It becomes apparent that to understand a language, AI must first be able to fully identify all of its grammatical characteristics. Although the same model can be used for multiple languages, it’s not enough to simply change the content (i.e. to translate it). Other languages have different types of words, declinations or ways of expressing number, for example. While German has four cases, Slavic languages have six cases (the same four as German plus the instrumental and prepositional cases). And in English the plural form of a word ending in Y is suddenly formed using the suffix -ies (e.g. baby ⇒ babies).
The accuracy of speech recognition now makes it possible to elicit the correct response. A response is considered correct if it makes sense syntactically and semantically. The easiest way to ensure a meaningful response is to create fixed question/answer pairs.
Question | Answer |
---|---|
What is your name? | My name is KI-1603. |
How are you? | I’m fine. |
How old are you? | I was developed in 2017. |
AI must be taught these question and answer patterns. The answer pairs can be saved during the development of the database, for example. This approach is often accused of not actually generating any real intelligence. However such criticisms overlook the fact that this is precisely how children learn. Children must also be given input that is constantly repeated. Intelligence does not spontaneously manifest, nor it is suddenly produced.
The static linking of question/answer pairs in a 1:1 relationship can, however, make the user feel like they are interacting with a dumb machine that can’t really understand the context (which, currently, AI is only partly capable of doing anyway). For this reason, it is worth creating n:m relationships. A question can have several different answers and those answers can vary in either form or content. The question How are you? could have a totally different answer today than it did yesterday.
Sometimes it’s even essential for the answer to be different: specifically, when the situation or circumstances have changed. For instance, if AI is told that its answer is wrong. If AI exhibits a certain ability to learn, it must be able to use this skill to subsequently provide the correct answer.
The ability to learn is viewed as an important aspect of intelligence. So AI should, if possible, be capable of learning.
Many simple AI programs can be trained. This can be done either passively or actively. Passive training is good for continuously improving the answer database and can be done using reflectivity, for example. Let’s take the following dialog:
The response reveals the AI’s lack of knowledge about how it feels; it simply reflects the question back at the user. The user then provides an answer, which is subsequently stored in the answer database. So in future if someone asks the AI How are you? then it will be able to answer I’m fine. However it only has a limited grasp of the context of this exchange. It is merely practicing observational learning, as a small child would.
Learning does not have to take place with an initial question, like in the example above. AI can also passively document and recite answer chains. If a future dialog starts at a specific point, AI can reenact an earlier dialog with the same structure. In the process, it can take in both sides of the conversation.
However there is a danger that it will be unintentionally or intentionally trained to death (as is generally the case with AI that is capable of being trained). On the one hand, it learns to mimic the conversational behavior of the person it is interacting with. However if that person speaks in a crude or obscene way, this will rub off on the AI. Malicious conversational partners can deliberately encourage this, in order to make the AI’s responses as vulgar as possible.
For this reason, it is worth introducing some form of moral and ethical resistance. Before any learning takes place, this mechanism checks whether the learning content will create morally or ethically objectionable results. If objectionable content is detected, it can select the conservative version. The easiest way of doing this is to identify obscene terms and assess the responses accordingly. However, this approach is not capable of precisely detecting passive aggressiveness (sarcasm and cynicism).
Some AI deal with this issue by reducing their ability to learn, so that they can only be taught using explicit instructions. Such instructions can typically be provided within the context of a conversation:
Here too, the n:m principle should still be used. In other words, it should be possible for one question to have several potential answers. These can be linked with sessions or users, for instance, or be guided by the broader context.
In addition, the weighting of an answer should increase the more often it is used correctly. If, from now on, you repeatedly point out to the AI that Ferraris are generally red, then the weighting for this response will increase. The higher the weighting, the greater the likelihood that it will be provided as the answer – even if the AI can’t find a precise match during the speech recognition. Only corrections made by the user can reduce the weighting, or in extreme cases, replace the answer and thereby make it obsolete.
If AI has the ability to learn, then it also has something like a memory. However this memory is tightly bound to the question/answer pairs and should therefore be viewed as a long-term memory. Question/answer pairs also apply as standard across different sessions and for different users.
In order to add a human dynamic to a conversation, AI requires a short-term memory. This can be achieved by linking information to sessions and/or user relationships. In principle, it works the same way as the long-term memory. But the technical implementation is different:
The different types of memories are prioritized differently. If something is stored in the short-term memory for a specific user, then this should be prioritized over the generalized short-term memory for all users and especially in relation to the long-term memory.
The human brain constantly ‘practices’ the information in the short-term memory, so that it becomes stored in the long-term memory. This is also possible with AI. At some point, the information becomes part of the permanent question/answer database. That is, once it has reached a certain logical weighting. In such instances, the confidentiality of personal information must be guaranteed. A secret entrusted to AI cannot suddenly be included in other conversations.
Mechanical interactions will never be able to generate human feelings. This would only be possible if AI itself knew certain emotional states and could express emotions.
If AI is insulted several times by a user, for example, it should retain the ability to continue reacting in a reserved manner. In order to do this, it must be able to identify the user’s emotions as it processes their speech. Again, this can be done using pattern recognition and it could even be part of the question/answer pairs in the database. An additional column for comments can be set up in the database and these comments can describe what effect the content will have on the user’s emotional state.
The current emotional state is recorded in a separate table, which can be applied across the entire system or linked to an individual user. The table below is a small excerpt, in which a value from 0 to 10 appears in the column status. The higher the value, the greater the likelihood that the corresponding emotion is being experienced by the user in that moment.
Feeling | Status |
---|---|
amused | 6 |
excited | 8 |
hyperactive | 3 |
cheerful | 2 |
attentive | 9 |
Now if the AI is suddenly confronted with a continually reoccurring question, the attentiveness can be reduced by 1 every time it is accessed. From a certain point onwards (e.g. a threshold of <5), the AI can point out that the conversation is not particularly exciting. Once the number reaches 0, the replies will feature only very short sentences.
The AI’s behavior should be guided by what the user would like to achieve with their commands. If the user is suddenly annoyed by the conversation, for instance, their reaction can be traced back to incorrect answers provided by the AI. This can adversely affect the AI’s self-confidence, which in turn would compel it to only provide answers that have a high probability of being correct. That way, it can win back the user’s trust.
Emotions should, above all, influence the form of the answers provided by AI. If possible, functionality itself should not be restricted. Otherwise AI could be perceived as moody and functionally limited.
That’s why negative feelings should automatically improve over time. This can be effectively achieved using a timer (e.g. every 10 minutes a +1 for attentive until 7 has been reached again) or simply with every new request (e.g. always +10%).
It is important to make AI behave as much like a human as possible, so that it will be perceived as a satisfactory replacement or as providing an acceptable level of support. To this end, AI must be equipped with a comprehensive linguistic understanding so that it can react correctly to the user’s requests. Additional characteristics, such as the ability to learn and reproduce emotions, will help to avoid AI being degraded as a mere robot. This is all perfectly feasible in terms of the technology, but it will involve a great deal of time and effort. In this regard, research has a lot of catching up to do.
Our experts will get in contact with you!
Marc Ruef
Marc Ruef
Marc Ruef
Marc Ruef
Our experts will get in contact with you!