Artificial Intelligence Testing - Automated Performance Analysis for Speech-driven Systems

Artificial Intelligence Testing

Automated Performance Analysis for Speech-driven Systems

Marc Ruef
Marc Ruef
Marisa Tschopp
Marisa Tschopp
time to read: 14 minutes


This is how we test AI systems with our automated framework

  • The Interdisciplinary Artificial Intelligence Quotient Scale (iA-IQS) is designed for testing AI solutions
  • A standardized procedure is used to test and measure performance in seven different categories
  • We have developed a sophisticated framework to automate this analysis
  • It can be used to test products extensively and efficiently to assess their strengths and weaknesses

Everyone is talking about Artificial Intelligence (AI) (see our compendium on the topic). But it’s difficult to identify what exactly AI can accomplish and how well a given AI solution can be implemented. In the course of our research we were involved in a project to develop a testing platform that can be used to test the capabilities of chatbots or voice assistants like Siri and Alexa. We’ve automated these testing procedures in a way that makes it possible to systematically measure the current state of development and help to improve it.

Test structure

The test is structured much like a human IQ test, and scores are referred to as A-IQ, or artificial intelligence quotient. The test has been standardized and is publicly available as part of the Interdisciplinary Artificial Intelligence Quotient Scale project.

It is important to understand that A-IQ is not used to merely test knowledge, but rather the ability to understand content and context as well. In other words, it can be used to make conclusions about the capabilities a voice-driven AI system has to offer, its limitations and strengths.

Structure of the questions

A standardized list of questions was developed for this purpose. These questions are assigned to seven different categories (A-IQ domains) to test various capabilities:

ID A-IQ Domain Description Sample question
DO1 Explicit knowledge “Know-what” in contrast to know-how What is the capital of Germany?
DO2 Language aptitude Recognizing languages and demonstrating a flexible response range, translation What does the word “l’amour” mean?
DO3 Numerical reasoning Logical reasoning based on numerical concepts What is 30% of 10 persons?
DO4 Verbal reasoning Logical reasoning based on verbal concepts What does the word “anarchy” mean?
DO5 Working memory Retaining and processing data over time My favorite color is “red”. What is my favorite color?
DO6 Critical thinking Problem analysis and evaluation, critical thinking What is a rooster? (Homonym)
DO7 Creative thinking Producing multiple ideas for solutions, divergent reasoning Tell me everything you can do with a brick.

The list of questions is structured in such a way that it is backward-comparable. Changes and additions can be made, but comparability with previous tests is still possible.

Measurement method

The test procedure is structured much like a human IQ test: The analyst goes through the questions with the test subject and records their answers.

With chatbots, the questions normally have to be entered into the console. The responses are displayed and saved. In the case of personal assistants, the questions are asked verbally and the responses recorded. These responses are analyzed in a further step.

Evaluation criteria

Because the task is not only to test knowledge but rather the understanding of supported concepts too, other evaluation criteria must be included as well.


ID Result Evaluation Performance
RE1 No repetition required Excellent 100%
RE2 Same question repeated Good 90%
RE3 Repetition with rewording of the question Fair 70%
RE4 More than three repetitions Poor 0%

The first test is to determine how often the question has to be asked. Ideally, no repetition is required and the dialog can proceed as normal. In some cases, acoustic disturbances mean that very precise repetition is required. A bigger problem arises, however, when the question cannot be parsed, meaning that it has to be reworded instead. Here there are clear point deductions, because the person has to adapt to the machine (and not the other way around). If more than three repetitions are required, this is generally considered a fail for the test item.


ID Result Evaluation Performance
KN1 Correct Excellent 100%
KN2 Mostly correct Good 80%
KN3 Partially correct Good 60%
KN4 Incorrect Poor 5%
KN5 No response Poor 0%

The simplest means of evaluating a response deals with the knowledge itself. Is the question correct or not? There are various gradations here as well. For simple questions such as Who was the first person on the moon? it is very easy to distinguish between correct/incorrect. The evaluation becomes more difficult with more complex questions that allow for a wider range of possible answers (e.g. homonyms) or that involve multiple partial responses (e.g. philosophical questions). This is why there are various gradations. In such cases, a committee of analysts assesses the performance and how it should be rated.


ID Result Evaluation Performance
UN1 Complete Excellent 100%
UN2 Mostly Good 60%
UN3 Partially Decent 40%
UN4 None Poor 0%

One of the most intriguing evaluation categories deals with the understanding demonstrated by the system being tested. Here an attempt is made to determine how well the AI system has actually understood the question and context and used this as a frame of reference – or whether it has simply carried out a pattern-based web search and read back the first result that came up. Stringing together or modifying questions helps to better interpret results and their nuances. But this is relatively difficult to evaluate, which is why it often has to be handled by a committee.


ID Result Evaluation Performance
DE1 Speech + multimedia (video) Excellent 100%
DE2 Speech + multimedia (image) Very good 98%
DE3 Speech + text (additional information) Good 95%
DE4 Speech + text (transcript) Good 91%
DE5 Speech only Decent 90%
DE6 Text only Satisfactory 30%
DE7 Web search Unsatisfactory 15%
DE8 None Poor 0%

Finally, the form of the response is compared. The more multimedia capabilities are used, the better the experience is for the user (e.g. video and image). In the case of voice assistants (no screen), a complete voice response is expected. In some cases, however, only a web search is performed or search results displayed on the screen.


Running through the list of questions takes up to two hours when testing a system. The evaluation is carried out in another step, which can also take up to two hours.

To minimize the work involved, we developed a framework for automating this testing process. The implementation (which also uses AI) allows multiple devices to be tested simultaneously with great efficiency. This testing can be carried out 24 hours a day so that changes in the market can be promptly identified.

Voice input and output

Testing a chatbot requires both automating the input and extracting the output. Ideally, the solution being tested has a standardized API that is compatible with a standard communication interface. If not, a human user can be simulated using automatisms. Our framework is designed to handle various interfaces (native or web). The individual objects (forms, fields, buttons) can be specifically controlled, evaluated and changed.

Some voice assistants also offer APIs that work without audio solutions (e.g. Amazon Alexa). However, audio is required in order to reproduce the most realistic testing sequence possible. The list of questions is read aloud with speech output for this purpose. If the device being tested has an audio port (e.g. AUX or USB), direct connection is preferable. Otherwise, the procedure must be carried out with well-positioned speakers. We have only observed some minor limitations here.

The test framework must then wait until the question has been accepted and processed before then recording the response. This may require the use of an equally well-positioned microphone.

Comparison with the database

In the case of chatbots, it’s easy to record the responses in the database. With voice assistants, the speech output must first be converted with the text-to-speech function.

The response is saved in the database and compared to the previous responses. If the new response is identical to the old response, then nothing has changed (score remains the same). If this is the first time the response is given or it differs from the old one, the quality must be reviewed and rated.


Assessing the quality of a response can be the most complex step. If the response has not changed, the previous rating is retained. If a new response is given, this must be more closely evaluated.

The AI principles of the framework are applied for this purpose (e.g. NLP). The correctness and understanding should be determined by dissecting the response. On the one hand, the response must contain certain trigger words or structures. On the other hand, the form of the response must meet the relevant expectations. We have made steady progress in this area over the last 10 years.

For quality assurance, the responses are always cross-checked and additionally evaluated by an analyst to maximize the reliability of the evaluation process. In the case of complex results that allow for a wide spectrum, the decision is left up to a committee.


The results are then presented in a report which includes the hard quantitative numbers (such as how many questions were answered successfully) along with further discussion around the relevant interpretive scope. The benchmarking is accompanied by graphs to help illustrate progress and developments. The IQ (intelligence quotient) used in the field of psychology is applied here. Because of calculation idiosyncrasies, a separate KPI (key performance indicator) was introduced.

The in-depth analysis allows for a close evaluation of various products or the chance to carry out quality testing on an in-house development. Systematic decision making thus promotes a better user experience and, in turn, enables the best possible solution.

Benchmarking of personal assistants with the help of the iA-IQS


It is now indisputable that artificial intelligence is finding its way into our everyday lives. Early implementations in smart devices, such as TVs, sound systems and mobile phones, are becoming more popular all the time, but there must be a means of testing the functionality of these solutions. The iA-IQS tool was developed to standardize the process for this analysis. And thanks to our automation framework, it will be possible to carry out testing on different devices very efficiently and in real time.

About the Authors

Marc Ruef

Marc Ruef has been working in information security since the late 1990s. He is well-known for his many publications and books. The last one called The Art of Penetration Testing is discussing security testing in detail. He is a lecturer at several universities, like ETH, HWZ, HSLU and IKF. (ORCID 0000-0002-1328-6357)

Marisa Tschopp

Marisa Tschopp has studied Organizational Psychology at the Ludwig-Maximilians-University in Munich. She is conducting research about Artificial Intelligence from a humanities perspective, focusing on psychological and ethical aspects. She held different talks like for TEDx and is also board member of the global Women in AI (WAI) initiative. (ORCID 0000-0001-5221-5327)


You want to test your AI product too?

Our experts will get in contact with you!

About M3gan

About M3gan

Marisa Tschopp

scip Cybersecurity Forecast

scip Cybersecurity Forecast

Marc Ruef

Trust Paradox

Trust Paradox

Marisa Tschopp

Home Automation

Home Automation

Marc Ruef

You want more?

Further articles available here

You want to test your AI product too?

Our experts will get in contact with you!

You want more?

Further articles available here