Testing of AI-Based Systems

Specification of AI-Based Systems

System requirements and design specifications are as important for AI-based systems as they are for conventional systems. These specifications form the basis for testers to verify that actual system behavior matches the specified requirements. However, if the specifications are incomplete and untestable, this leads to a test oracle problem (see Section 8.7).

There are several reasons why the specification of AI-based systems can be particularly challenging:

  • In many projects for AI-based systems, requirements are specified only in terms of high-level business goals and required predictions. One reason for this is the exploratory nature of AI-based system development. Often, AI-based system projects start with a data set, and the goal is to determine what predictions can be obtained from that data. This is in contrast to determining the logic required at the beginning of a traditional project.
  • The accuracy of the AI-based system is often unknown until the results of independent testing are available. Combined with the exploratory development approach, this often leads to inadequate specifications because the implementation is already underway when the desired acceptance criteria are established.
  • The probabilistic nature of many AI-enabled systems may require that tolerances be established for some of the expected quality requirements, such as accuracy of predictions.
  • If the goals of the system are to emulate human behavior rather than provide specific functions, this often leads to poorly specified behavioral requirements based on the assumption that the system should be as good as or better than the human activities it is intended to replace. This can complicate the definition of a test oracle, especially when the capabilities of the humans it is intended to replace vary widely.
  • When AI is used to implement user interfaces, such as natural language recognition, computer vision, or physical interaction with humans, the systems must have greater flexibility. However, such flexibility can also lead to challenges in identifying and documenting all the different ways in which such interactions can occur.
  • Quality attributes specific to AI-based systems, such as adaptability, flexibility, evolution, and autonomy, must be considered and defined as part of the requirements specification (see Chapter 2). Due to the novelty of these properties, it can be difficult to define and test them.

Test Levels for AI-based Systems

AI-based systems typically include both AI and non-AI components. Non-AI components can be tested using traditional approaches, while AI components and systems that include AI components must be tested differently in some respects, as described below. For all levels of testing that involve testing AI components, it is important that testing be closely supported by data engineers/scientists and domain experts.
A key difference from the testing levels for traditional software is the inclusion of two new, specialized testing levels that explicitly address testing of the input data and models used in AI-based systems. Most of this section is applicable to all AI-based systems, although some parts are specific to ML.

Checking the input data

Input data checking is intended to ensure that the data used by the system for training and prediction is of the highest quality (see Section 4.3). It includes the following:

  • Verifications
  • Statistical procedures (e.g., testing the data for bias).
  • EDA of the training data
  • Static and dynamic testing of the data pipeline.

The data pipeline typically includes several data preparation components (see Section 4.1), and testing of these components includes both component testing and integration testing.
The data pipeline for training may be significantly different from the data pipeline used to support operational prediction. For training, the data pipeline can be considered a prototype, compared to the fully developed, automated version used in operations. For this reason, testing of these two versions of the data pipeline may be very different. However, testing the functional equivalence of the two versions should also be considered.

Testing ML Models

The goal of ML model testing is to ensure that the selected model meets any performance criteria that may be specified. These include:

  • ML functional performance criteria (see Sections 5.1 and 5.2).
  • ML non-functional acceptance criteria appropriate to the model itself, such as training speed, prediction speed, computational resources used, adaptability, and transparency.

ML model testing also aims to determine whether the choice of ML framework, algorithm, model, model settings, and hyperparameters is as close to optimal as possible. Where appropriate, ML model testing may also include testing to achieve white-box coverage criteria (see Section 6.2). The selected model is later integrated with other components, AI and non-AI.

Component Testing

Component testing is a conventional level of testing that applies to all non-model components, such as user interfaces and communication components.

Component Integration Testing

Component integration testing is a conventional level of testing performed to ensure that system components (both AI and non-AI) interact correctly. Testing is performed to ensure that inputs from the data pipeline are received by the model as expected and that all predictions generated by the model are exchanged with and used correctly by relevant system components (e.g., the user interface).
When AI is provided as a service (see Section 1.7), it is common to perform API testing of the provided service as part of component integration testing.

System Testing

System testing is a conventional level of testing designed to ensure that the overall system of integrated components (both AI and non-AI) functions as expected from both a functional and non-functional perspective in a test environment that closely resembles the operational environment. Depending on the system, these tests may take the form of field testing in the expected operational environment or in a simulator (e.g., if the test scenarios are hazardous or difficult to reproduce in an operational environment).
During system testing, the functional performance criteria of the artificial intelligence are retested to ensure that the test results of the initial tests of the artificial intelligence model are not compromised when the model is embedded in a full system. These tests are especially important if the AI component has been intentionally modified (e.g., compressing a DNN to reduce its size).
System testing is also the level of testing at which many of the non-functional requirements for the system are tested. For example, adversarial tests can be performed to test robustness, and the system can be tested for explainability. Where appropriate, interfaces to hardware components (e.g., sensors) can be tested as part of the system testing.

Acceptance test

Acceptance testing is a conventional level of testing and is used to determine if the overall system is acceptable to the customer. For AI-based systems, defining acceptance criteria can be challenging (see Section 8.8). If AI is provided as a service (see Section 1.7), acceptance testing may be required to determine whether the service is suitable for the intended system and whether, for example, the ML functionality criteria have been sufficiently met.

Test data for testing AI-based systems

Depending on the situation and the system under test (SUT), collecting test data can be challenging. There are several potential challenges in handling test data for AI-based systems, including:

  • Big Data (data with high volume, high velocity, and high variance) can be difficult to create and manage. For example, it may be difficult to create representative test data for a system that processes large volumes of images and audio data at high speed.
  • Input data may need to change over time, especially if it represents real-world events.
    For example, captured photos for testing a facial recognition system may need to be “aged” to represent the aging of people over several years in the real world.
  • Personal or otherwise sensitive data may need to be sanitized, encrypted, or redacted using special techniques. Legal permission for use may also be required.
  • If testers use the same implementation as data scientists for data collection and preprocessing, errors in these steps can be masked.

Testing for automation errors in AI-based systems

One category of AI-based systems assists humans in decision making. Occasionally, however, humans tend to place too much trust in these systems. This misplaced trust can be referred to as either automation bias or complacency bias and takes two forms:

  • The first form of automation or complacency bias is when a human accepts the system’s recommendations and does not consider the input from other sources (including themselves). For example, a process in which a human types data into a form could be improved by using machine learning to pre-populate the form, and the human then validates that data. This form of automation has been shown to typically reduce the quality of decisions made by 5%, but depending on the system context, this value can be much higher. Automatic correction of typed text (e.g. in cell phones) is also often erroneous and can change the meaning. Users often do not notice this and do not override the error.
  • The second form of automation/compatibility error is when humans miss a system error because they do not adequately monitor the system. For example, semi-autonomous vehicles are becoming increasingly self-driving, but still rely on a human to take the wheel in the event of an impending accident. Typically, the human occupant of the vehicle gradually becomes too trusting of the system’s ability to control the vehicle, and he begins to pay less attention. This can result in his inability to react appropriately when needed.

In both scenarios, testers should understand how human decision making may be affected and test both the quality of system recommendations and the quality of corresponding human input from representative users.

Documenting an AI Component

The typical content for the documentation of an AI component includes:

  • General: identifier, description, developer details, hardware requirements, license details, version, date, and point of contact.
  • Design: assumptions and technical decisions.
  • Usage: primary and secondary use cases, typical users, self-learning approach, known biases, ethical issues, security issues, transparency, decision thresholds, platform, and concept drift.
  • Datasets: Characteristics, collection, availability, pre-processing requirements, usage, content, labeling, size, privacy, security, bias/fairness, and limitations/restrictions.
  • Testing: test data set (description and availability), test independence, test results, test approach for robustness, explainability, concept drift, and transferability.
  • Training and ML functional performance: ML algorithm, weights, validation data set, selection of ML performance metrics, thresholds for ML performance metrics, and actual ML performance metrics.

Clear documentation helps improve testing by providing transparency about the implementation of the AI-based system. The key areas of documentation that are important for testing are:

  • The purpose of the system and the specification of functional and non-functional requirements. These types of documentation are usually part of the testing foundation.
  • Architecture and design information that shows how the various AI and non-AI components interact. This supports the identification of integration test objectives and can provide a basis for white-box testing of the system structure.
  • Specification of the operating environment. This is required to test the autonomy, flexibility, and adaptability of the system.
  • The source of all input data, including associated metadata. This must be clearly understood when testing the following aspects:
    • Functional correctness of untrusted inputs.
    • Explicit or implicit sampling bias.
    • Flexibility, including mislearning due to bad data inputs in self-learning systems.
  • The manner in which the system is expected to adapt to changes in its operational environment. This is needed as a test basis for testing adaptability.
  • Details of the expected system users. This is required to ensure that testing is representative.

Testing for concept deviation

The deployment environment may change over time without a corresponding change in the trained model. This phenomenon is referred to as concept drift and usually causes the results of the model to become increasingly inaccurate and less useful. For example, the impact of marketing campaigns can cause the behavior of potential customers to change over time. Such changes may be seasonal or abrupt due to cultural, moral, or societal changes that are external to the system. An example of such an abrupt change is the impact of the COVID-19 pandemic and its implications for the accuracy of models used for revenue forecasting and stock markets.
Systems that are susceptible to concept deviation should be tested periodically against the agreed upon ML functional criteria to ensure that the occurrence of concept deviation is detected early enough so that the problem can be mitigated. Typical remedies include decommissioning the system or retraining the system. In the case of retraining, this would be conducted with current training data
and followed by confirmation testing, regression testing, and possibly some form of A/B testing (see Section 9.4) where the updated B system must outperform the original A system.

Selecting a testing approach for an ML system

An AI-based system typically contains both AI and non-AI components. The testing approach is based on a risk analysis for such a system and includes both conventional tests and more specialized tests to address factors specific to AI components and AI-based systems.
The following list includes some typical risks and corresponding mitigations that are specific to ML systems.
Note that this list contains only a limited number of examples and that there are many other risks specific to ML systems that need to be mitigated through testing.

Risk Aspect Description and possible Mitigations
The data quality may be lower than expected. This risk can become a problem in several ways, each of which is prevented in different ways (see Section 4.4). Common remedies include the use of reviews, EDA, and dynamic testing.
The operating data pipeline may be faulty. This risk can be partially mitigated by dynamic testing of individual pipeline components and integration testing of the entire pipeline.
The ML workflow used to develop the model may be suboptimal.

(See Section 3.2)

This risk could be due to the following:

  • Lack of prior agreement on the ML workflow to be followed.
  • A poor choice of workflow
  • Data engineers not following the workflow.

Reviews by experts can mitigate the risk of choosing the wrong workflow, while more hands-on management or audits could address issues with workflow agreement and implementation.

The choice of ML framework, algorithm, model, model settings, and/or hyperparameters may be suboptimal. This risk could be due to a lack of expertise among decision makers or poor implementation of the assessment and reconciliation steps (or test steps) of the ML workflow.
Reviews by experts can reduce the risk of wrong decisions, and better management can ensure that the evaluation and reconciliation steps (and test steps) of the workflow are followed.
The desired functional performance criteria of the ML may not be met even though the ML component meets these criteria on its own. This risk could be due to the fact that the data sets used for training and testing the model are not representative of the data generated in practice.
Expert (or user) review of the selected data sets may mitigate the risk that the selected data are not representative.
The desired ML functional performance criteria are met, but users may be dissatisfied with the results delivered. This risk could be due to the selection of the wrong performance criteria (e.g., a high recall rate was selected when high precision would have been required).
Reviews with experts can reduce the risk of selecting the wrong ML performance metrics, or experience-based testing could also identify inappropriate criteria. The risk could also be due to concept deviation. In this case, more frequent testing of the operational system could mitigate the risk.
The desired functional ML performance criteria are met, but users may be dissatisfied with the service provided. This risk could be due to a lack of focus on the non-functional requirements of the system). Note that the range of quality attributes for AI-based systems goes beyond those listed in ISO/IEC 25010 (see Chapter 2).
Using a risk-based approach to prioritize the quality attributes and performing the appropriate non-functional tests could mitigate this risk.
Alternatively, the problem could be due to a combination of factors that could be identified through experience-based testing as part of system testing. Chapter 8 provides guidance on testing these characteristics.
The self-learning system may not perform as expected by users. This risk can have several reasons, for example:

  • The data that the system uses for self-learning may be inappropriate. In this case, reviews by experts could identify the problematic data.
  • The system could fail because new self-learned features are not acceptable. This could be mitigated by automated regression testing, including performance comparison with previous functionality.
  • The system may learn in ways that users do not expect, which could be uncovered through experience-based testing.
Users may be frustrated because they do not understand how the system makes its decisions. This risk could be due to a lack of explainability, interpretability and/or transparency. See section 8.6 for details on testing for these features.
It may happen that the model gives excellent predictions when the data is similar to the training data, but otherwise gives poor results. This risk may be due to overfitting (see Section 3.5.1), which can be uncovered by testing the model with a dataset completely independent of the training dataset or by experience-based testing.

Source: ISTQB®: Certified Tester AI Testing (CT-AI) Syllabus Version 1.0

Was this article helpful?

Related Articles