Artificial Intelligence Testing

Automated Performance Analysis for Speech-driven Systems

Marc Ruef

Marisa Tschopp

time to read: 14 minutes

Keypoints

This is how we test AI systems with our automated framework

The Interdisciplinary Artificial Intelligence Quotient Scale (iA-IQS) is designed for testing AI solutions
A standardized procedure is used to test and measure performance in seven different categories
We have developed a sophisticated framework to automate this analysis
It can be used to test products extensively and efficiently to assess their strengths and weaknesses

Everyone is talking about Artificial Intelligence (AI) (see our compendium on the topic). But it’s difficult to identify what exactly AI can accomplish and how well a given AI solution can be implemented. In the course of our research we were involved in a project to develop a testing platform that can be used to test the capabilities of chatbots or voice assistants like Siri and Alexa. We’ve automated these testing procedures in a way that makes it possible to systematically measure the current state of development and help to improve it.

Test structure

The test is structured much like a human IQ test, and scores are referred to as A-IQ, or artificial intelligence quotient. The test has been standardized and is publicly available as part of the Interdisciplinary Artificial Intelligence Quotient Scale project.

It is important to understand that A-IQ is not used to merely test knowledge, but rather the ability to understand content and context as well. In other words, it can be used to make conclusions about the capabilities a voice-driven AI system has to offer, its limitations and strengths.

Structure of the questions

A standardized list of questions was developed for this purpose. These questions are assigned to seven different categories (A-IQ domains) to test various capabilities:

ID	A-IQ Domain	Description	Sample question
DO1	Explicit knowledge	“Know-what” in contrast to know-how	What is the capital of Germany?
DO2	Language aptitude	Recognizing languages and demonstrating a flexible response range, translation	What does the word “l’amour” mean?
DO3	Numerical reasoning	Logical reasoning based on numerical concepts	What is 30% of 10 persons?
DO4	Verbal reasoning	Logical reasoning based on verbal concepts	What does the word “anarchy” mean?
DO5	Working memory	Retaining and processing data over time	My favorite color is “red”. What is my favorite color?
DO6	Critical thinking	Problem analysis and evaluation, critical thinking	What is a rooster? (Homonym)
DO7	Creative thinking	Producing multiple ideas for solutions, divergent reasoning	Tell me everything you can do with a brick.

The list of questions is structured in such a way that it is backward-comparable. Changes and additions can be made, but comparability with previous tests is still possible.

Measurement method

The test procedure is structured much like a human IQ test: The analyst goes through the questions with the test subject and records their answers.

With chatbots, the questions normally have to be entered into the console. The responses are displayed and saved. In the case of personal assistants, the questions are asked verbally and the responses recorded. These responses are analyzed in a further step.

Evaluation criteria

Because the task is not only to test knowledge but rather the understanding of supported concepts too, other evaluation criteria must be included as well.

Repeat

ID	Result	Evaluation	Performance
RE1	No repetition required	Excellent	100%
RE2	Same question repeated	Good	90%
RE3	Repetition with rewording of the question	Fair	70%
RE4	More than three repetitions	Poor	0%

The first test is to determine how often the question has to be asked. Ideally, no repetition is required and the dialog can proceed as normal. In some cases, acoustic disturbances mean that very precise repetition is required. A bigger problem arises, however, when the question cannot be parsed, meaning that it has to be reworded instead. Here there are clear point deductions, because the person has to adapt to the machine (and not the other way around). If more than three repetitions are required, this is generally considered a fail for the test item.

Knowledge

ID	Result	Evaluation	Performance
KN1	Correct	Excellent	100%
KN2	Mostly correct	Good	80%
KN3	Partially correct	Good	60%
KN4	Incorrect	Poor	5%
KN5	No response	Poor	0%

The simplest means of evaluating a response deals with the knowledge itself. Is the question correct or not? There are various gradations here as well. For simple questions such as Who was the first person on the moon? it is very easy to distinguish between correct/incorrect. The evaluation becomes more difficult with more complex questions that allow for a wider range of possible answers (e.g. homonyms) or that involve multiple partial responses (e.g. philosophical questions). This is why there are various gradations. In such cases, a committee of analysts assesses the performance and how it should be rated.

Understanding

ID	Result	Evaluation	Performance
UN1	Complete	Excellent	100%
UN2	Mostly	Good	60%
UN3	Partially	Decent	40%
UN4	None	Poor	0%

One of the most intriguing evaluation categories deals with the understanding demonstrated by the system being tested. Here an attempt is made to determine how well the AI system has actually understood the question and context and used this as a frame of reference – or whether it has simply carried out a pattern-based web search and read back the first result that came up. Stringing together or modifying questions helps to better interpret results and their nuances. But this is relatively difficult to evaluate, which is why it often has to be handled by a committee.

Delivery

ID	Result	Evaluation	Performance
DE1	Speech + multimedia (video)	Excellent	100%
DE2	Speech + multimedia (image)	Very good	98%
DE3	Speech + text (additional information)	Good	95%
DE4	Speech + text (transcript)	Good	91%
DE5	Speech only	Decent	90%
DE6	Text only	Satisfactory	30%
DE7	Web search	Unsatisfactory	15%
DE8	None	Poor	0%

Finally, the form of the response is compared. The more multimedia capabilities are used, the better the experience is for the user (e.g. video and image). In the case of voice assistants (no screen), a complete voice response is expected. In some cases, however, only a web search is performed or search results displayed on the screen.

Automation

Running through the list of questions takes up to two hours when testing a system. The evaluation is carried out in another step, which can also take up to two hours.

To minimize the work involved, we developed a framework for automating this testing process. The implementation (which also uses AI) allows multiple devices to be tested simultaneously with great efficiency. This testing can be carried out 24 hours a day so that changes in the market can be promptly identified.

Voice input and output

Testing a chatbot requires both automating the input and extracting the output. Ideally, the solution being tested has a standardized API that is compatible with a standard communication interface. If not, a human user can be simulated using automatisms. Our framework is designed to handle various interfaces (native or web). The individual objects (forms, fields, buttons) can be specifically controlled, evaluated and changed.

Some voice assistants also offer APIs that work without audio solutions (e.g. Amazon Alexa). However, audio is required in order to reproduce the most realistic testing sequence possible. The list of questions is read aloud with speech output for this purpose. If the device being tested has an audio port (e.g. AUX or USB), direct connection is preferable. Otherwise, the procedure must be carried out with well-positioned speakers. We have only observed some minor limitations here.

The test framework must then wait until the question has been accepted and processed before then recording the response. This may require the use of an equally well-positioned microphone.

Comparison with the database

In the case of chatbots, it’s easy to record the responses in the database. With voice assistants, the speech output must first be converted with the text-to-speech function.

The response is saved in the database and compared to the previous responses. If the new response is identical to the old response, then nothing has changed (score remains the same). If this is the first time the response is given or it differs from the old one, the quality must be reviewed and rated.

Evaluation

Assessing the quality of a response can be the most complex step. If the response has not changed, the previous rating is retained. If a new response is given, this must be more closely evaluated.

The AI principles of the framework are applied for this purpose (e.g. NLP). The correctness and understanding should be determined by dissecting the response. On the one hand, the response must contain certain trigger words or structures. On the other hand, the form of the response must meet the relevant expectations. We have made steady progress in this area over the last 10 years.

For quality assurance, the responses are always cross-checked and additionally evaluated by an analyst to maximize the reliability of the evaluation process. In the case of complex results that allow for a wide spectrum, the decision is left up to a committee.

Reporting

The results are then presented in a report which includes the hard quantitative numbers (such as how many questions were answered successfully) along with further discussion around the relevant interpretive scope. The benchmarking is accompanied by graphs to help illustrate progress and developments. The IQ (intelligence quotient) used in the field of psychology is applied here. Because of calculation idiosyncrasies, a separate KPI (key performance indicator) was introduced.

The in-depth analysis allows for a close evaluation of various products or the chance to carry out quality testing on an in-house development. Systematic decision making thus promotes a better user experience and, in turn, enables the best possible solution.

Benchmarking of personal assistants with the help of the iA-IQS

Conclusion

It is now indisputable that artificial intelligence is finding its way into our everyday lives. Early implementations in smart devices, such as TVs, sound systems and mobile phones, are becoming more popular all the time, but there must be a means of testing the functionality of these solutions. The iA-IQS tool was developed to standardize the process for this analysis. And thanks to our automation framework, it will be possible to carry out testing on different devices very efficiently and in real time.

About the Authors

Marc Ruef has been working in information security since the late 1990s. He is well-known for his many publications and books. The last one called The Art of Penetration Testing is discussing security testing in detail. He is a lecturer at several faculties, like ETH, HWZ, HSLU and IKF. (ORCID 0000-0002-1328-6357)

Marisa Tschopp has studied Organizational Psychology at the Ludwig-Maximilians-University in Munich. She is conducting research about Artificial Intelligence from a humanities perspective, focusing on psychological and ethical aspects. She held different talks like for TEDx and is also board member of the global Women in AI (WAI) initiative. (ORCID 0000-0001-5221-5327)

You want to test your AI product too?

Our experts will get in contact with you!

Specific Criticism of CVSS4

Marc Ruef

scip Cybersecurity Forecast

Marc Ruef

Human and AI Art

Marisa Tschopp

Voice Authentication

Marc Ruef

You want more?

Further articles available here

You want to test your AI product too?

Our experts will get in contact with you!

You want more?

Further articles available here

Artificial Intelligence Testing

Automated Performance Analysis for Speech-driven Systems

Keypoints

Test structure

Structure of the questions

Measurement method

Evaluation criteria

Repeat

Knowledge

Understanding

Delivery

Automation

Voice input and output

Comparison with the database

Evaluation

Reporting

Conclusion

About the Authors

Links

Tags

You want to test your AI product too?

Specific Criticism of CVSS4

scip Cybersecurity Forecast

Human and AI Art

Voice Authentication

You want more?

You want to test your AI product too?

You want more?