Human and AI
Marisa Tschopp
This is how we test AI systems with our automated framework
The test is structured much like a human IQ test, and scores are referred to as A-IQ, or artificial intelligence quotient. The test has been standardized and is publicly available as part of the Interdisciplinary Artificial Intelligence Quotient Scale project.
It is important to understand that A-IQ is not used to merely test knowledge, but rather the ability to understand content and context as well. In other words, it can be used to make conclusions about the capabilities a voice-driven AI system has to offer, its limitations and strengths.
A standardized list of questions was developed for this purpose. These questions are assigned to seven different categories (A-IQ domains) to test various capabilities:
ID | A-IQ Domain | Description | Sample question |
---|---|---|---|
DO1 | Explicit knowledge | “Know-what” in contrast to know-how | What is the capital of Germany? |
DO2 | Language aptitude | Recognizing languages and demonstrating a flexible response range, translation | What does the word “l’amour” mean? |
DO3 | Numerical reasoning | Logical reasoning based on numerical concepts | What is 30% of 10 persons? |
DO4 | Verbal reasoning | Logical reasoning based on verbal concepts | What does the word “anarchy” mean? |
DO5 | Working memory | Retaining and processing data over time | My favorite color is “red”. What is my favorite color? |
DO6 | Critical thinking | Problem analysis and evaluation, critical thinking | What is a rooster? (Homonym) |
DO7 | Creative thinking | Producing multiple ideas for solutions, divergent reasoning | Tell me everything you can do with a brick. |
The list of questions is structured in such a way that it is backward-comparable. Changes and additions can be made, but comparability with previous tests is still possible.
The test procedure is structured much like a human IQ test: The analyst goes through the questions with the test subject and records their answers.
With chatbots, the questions normally have to be entered into the console. The responses are displayed and saved. In the case of personal assistants, the questions are asked verbally and the responses recorded. These responses are analyzed in a further step.
Because the task is not only to test knowledge but rather the understanding of supported concepts too, other evaluation criteria must be included as well.
ID | Result | Evaluation | Performance |
---|---|---|---|
RE1 | No repetition required | Excellent | 100% |
RE2 | Same question repeated | Good | 90% |
RE3 | Repetition with rewording of the question | Fair | 70% |
RE4 | More than three repetitions | Poor | 0% |
The first test is to determine how often the question has to be asked. Ideally, no repetition is required and the dialog can proceed as normal. In some cases, acoustic disturbances mean that very precise repetition is required. A bigger problem arises, however, when the question cannot be parsed, meaning that it has to be reworded instead. Here there are clear point deductions, because the person has to adapt to the machine (and not the other way around). If more than three repetitions are required, this is generally considered a fail for the test item.
ID | Result | Evaluation | Performance |
---|---|---|---|
KN1 | Correct | Excellent | 100% |
KN2 | Mostly correct | Good | 80% |
KN3 | Partially correct | Good | 60% |
KN4 | Incorrect | Poor | 5% |
KN5 | No response | Poor | 0% |
The simplest means of evaluating a response deals with the knowledge itself. Is the question correct or not? There are various gradations here as well. For simple questions such as Who was the first person on the moon? it is very easy to distinguish between correct/incorrect. The evaluation becomes more difficult with more complex questions that allow for a wider range of possible answers (e.g. homonyms) or that involve multiple partial responses (e.g. philosophical questions). This is why there are various gradations. In such cases, a committee of analysts assesses the performance and how it should be rated.
ID | Result | Evaluation | Performance |
---|---|---|---|
UN1 | Complete | Excellent | 100% |
UN2 | Mostly | Good | 60% |
UN3 | Partially | Decent | 40% |
UN4 | None | Poor | 0% |
One of the most intriguing evaluation categories deals with the understanding demonstrated by the system being tested. Here an attempt is made to determine how well the AI system has actually understood the question and context and used this as a frame of reference – or whether it has simply carried out a pattern-based web search and read back the first result that came up. Stringing together or modifying questions helps to better interpret results and their nuances. But this is relatively difficult to evaluate, which is why it often has to be handled by a committee.
ID | Result | Evaluation | Performance |
---|---|---|---|
DE1 | Speech + multimedia (video) | Excellent | 100% |
DE2 | Speech + multimedia (image) | Very good | 98% |
DE3 | Speech + text (additional information) | Good | 95% |
DE4 | Speech + text (transcript) | Good | 91% |
DE5 | Speech only | Decent | 90% |
DE6 | Text only | Satisfactory | 30% |
DE7 | Web search | Unsatisfactory | 15% |
DE8 | None | Poor | 0% |
Finally, the form of the response is compared. The more multimedia capabilities are used, the better the experience is for the user (e.g. video and image). In the case of voice assistants (no screen), a complete voice response is expected. In some cases, however, only a web search is performed or search results displayed on the screen.
Running through the list of questions takes up to two hours when testing a system. The evaluation is carried out in another step, which can also take up to two hours.
To minimize the work involved, we developed a framework for automating this testing process. The implementation (which also uses AI) allows multiple devices to be tested simultaneously with great efficiency. This testing can be carried out 24 hours a day so that changes in the market can be promptly identified.
Testing a chatbot requires both automating the input and extracting the output. Ideally, the solution being tested has a standardized API that is compatible with a standard communication interface. If not, a human user can be simulated using automatisms. Our framework is designed to handle various interfaces (native or web). The individual objects (forms, fields, buttons) can be specifically controlled, evaluated and changed.
Some voice assistants also offer APIs that work without audio solutions (e.g. Amazon Alexa). However, audio is required in order to reproduce the most realistic testing sequence possible. The list of questions is read aloud with speech output for this purpose. If the device being tested has an audio port (e.g. AUX or USB), direct connection is preferable. Otherwise, the procedure must be carried out with well-positioned speakers. We have only observed some minor limitations here.
The test framework must then wait until the question has been accepted and processed before then recording the response. This may require the use of an equally well-positioned microphone.
In the case of chatbots, it’s easy to record the responses in the database. With voice assistants, the speech output must first be converted with the text-to-speech function.
The response is saved in the database and compared to the previous responses. If the new response is identical to the old response, then nothing has changed (score remains the same). If this is the first time the response is given or it differs from the old one, the quality must be reviewed and rated.
Assessing the quality of a response can be the most complex step. If the response has not changed, the previous rating is retained. If a new response is given, this must be more closely evaluated.
The AI principles of the framework are applied for this purpose (e.g. NLP). The correctness and understanding should be determined by dissecting the response. On the one hand, the response must contain certain trigger words or structures. On the other hand, the form of the response must meet the relevant expectations. We have made steady progress in this area over the last 10 years.
For quality assurance, the responses are always cross-checked and additionally evaluated by an analyst to maximize the reliability of the evaluation process. In the case of complex results that allow for a wide spectrum, the decision is left up to a committee.
The results are then presented in a report which includes the hard quantitative numbers (such as how many questions were answered successfully) along with further discussion around the relevant interpretive scope. The benchmarking is accompanied by graphs to help illustrate progress and developments. The IQ (intelligence quotient) used in the field of psychology is applied here. Because of calculation idiosyncrasies, a separate KPI (key performance indicator) was introduced.
The in-depth analysis allows for a close evaluation of various products or the chance to carry out quality testing on an in-house development. Systematic decision making thus promotes a better user experience and, in turn, enables the best possible solution.
It is now indisputable that artificial intelligence is finding its way into our everyday lives. Early implementations in smart devices, such as TVs, sound systems and mobile phones, are becoming more popular all the time, but there must be a means of testing the functionality of these solutions. The iA-IQS tool was developed to standardize the process for this analysis. And thanks to our automation framework, it will be possible to carry out testing on different devices very efficiently and in real time.
Our experts will get in contact with you!
Marisa Tschopp
Marc Ruef
Marc Ruef
Marisa Tschopp
Our experts will get in contact with you!