Testing of AI-Specific Quality Characteristics

Challenges in Testing Self-Learning Systems

There are several potential challenges to overcome when testing self-learning systems (see Chapter 2 for more details on these systems), including:

  • Unexpected changes: The original requirements and constraints within which the system should operate are generally known, but there may be little or no information about changes made by the system itself. It is usually possible to test against the original requirements and design (and any specified constraints), but if the system has developed an innovative implementation or gamed a solution (whose implementation is not visible), it may be difficult to develop tests appropriate for this new implementation. As systems change themselves (and their results), the results of previously passed tests may also change. This is a challenge for test design. It can be addressed by designing appropriate tests that remain relevant as the system changes behavior, thereby avoiding a potential problem in regression testing. However, it may also require designing new tests based on the observed new system behavior.
  • Complex acceptance criteria: It may be necessary to define expectations for improving the system as it learns itself. For example, it may be expected that the overall performance of the system should improve as it changes itself. In addition, defining something other than a simple “improvement” can quickly become complex. For example, a minimum improvement may be expected (rather than just any improvement), or the required improvement may be linked to environmental factors (e.g., a minimum 10% improvement in functionality X is required if the environmental factor F changes by more than Y). These issues can be addressed by specifying and testing against the more complex acceptance criteria and by continuously recording the current baseline functional performance of the system.
  • Insufficient test time: it may be necessary to know how quickly the system should learn and adapt in different scenarios. These acceptance criteria can be difficult to specify and procure. If a system adapts quickly, there may not be enough time to manually run new tests after each change, so it may be necessary to write tests that can be run automatically when the system itself changes. These challenges can be overcome by specifying appropriate acceptance criteria (see Section 8.8) and automated continuous testing.
  • Resource Requirements: System requirements may include acceptance criteria for the resources that the system may use when performing self-learning or adaptation processes. This may include, for example, the amount of processing time and memory that may be used for enhancement. In addition, consider whether this resource use should be associated with a measurable improvement in functionality or accuracy. This challenge impacts the definition of acceptance criteria.
  • Insufficient specifications of the operating environment: a self-learning system may change if the inputs it receives from the environment are outside the expected ranges or if they are not included in the training data. These inputs can be attacks in the form of data poisoning (see Section 9.1.2). It can be difficult to predict the full range of operating environments and environmental changes, and therefore to identify the full set of representative test cases and environmental requirements. Ideally, the full range of possible changes in the operating environment to which the system is to respond is defined as acceptance criteria.
  • Complex test environment: Managing the test environment to ensure that it can mimic all potential high-risk changes to the operating environment is challenging and may require the use of test tools (e.g., a fault injection tool). Depending on the nature of the operating environment, this may be tested by manipulating inputs and sensors or by gaining access to various physical environments in which the system can be tested.
  • Unwanted behavior changes: A self-learning system will change its behavior based on its inputs, and testers may not be able to prevent this from happening. This may be the case, for example, when using a third-party system or when testing the production system. By repeating the same tests, a self-learning system can respond more effectively to these tests, which can then affect the long-term behavior of the system. It is therefore important to prevent the tests from negatively affecting the behavior of a self-learning system. This is a challenge for test case design and test management.

Testing autonomous AI-based systems

Autonomous systems must be able to determine when they need human intervention and when they do not. Testing for autonomy of AI-based systems therefore requires creating conditions under which the system can perform this decision making.

Testing for autonomy may require the following:

  • Testing whether the system requires human intervention for a given scenario when the system should relinquish control. Such scenarios could include a change in the operating environment or the system exceeding autonomy limits.
  • Checking whether the system requests human intervention when the system should relinquish control after a specified period of time.
  • Testing whether the system is unnecessarily requesting human intervention when it should still be operating autonomously.

It may be helpful to use a limit analysis applied to the operating environment to provide the necessary conditions for this test. It can be challenging to define how the parameters that determine autonomy manifest themselves in the operating environment and to create the test scenarios that depend on the type of autonomy.

Testing for algorithmic, sampling, and inappropriate biases

An ML system should be evaluated with respect to the various biases and measures to eliminate inappropriate biases. This may mean intentionally introducing a positive bias to counteract the inappropriate bias.

Testing with an independent data set can often detect bias. However, it can be difficult to identify all the data causing bias because the ML algorithm can use combinations of seemingly unrelated features to create undesirable bias.

AI-based systems should be tested for algorithmic bias, sampling bias, and inappropriate bias (see Section 2.4). This may include:

  • Analysis during the training, evaluation, and optimization activities of the model to determine if algorithmic bias is present.
  • Review of the source of the training data and the procedures used to obtain it to detect the presence of sampling bias.
  • Reviewing the preprocessing of data as part of the ML workflow to determine if the data has been influenced in a way that could cause sampling bias.
  • Measuring how changes in system inputs affect system outputs over a large number of interactions, and examining the results based on the groups of people or objects that the system may be inappropriately biased toward or against. This is similar to the Local Interpretable Model-Agnostic Explanations (LIME) method described in Section 8.6 and can be performed in a production environment as well as in pre-release testing.
  • Obtaining additional information about the attributes of the input data that may be related to the bias and correlating them with the results. For example, this could refer to demographic data that might be appropriate when testing for inappropriate biases involving groups of people, when membership in a group is relevant to assessing the bias but is not an input to the model.
    This is because the bias may be based on “hidden” variables that are not explicitly present in the input data but are inferred by the algorithm.

Challenges in testing probabilistic and non-deterministic AI-based Systems

Most probabilistic systems are also non-deterministic, so the following list of testing challenges typically applies to AI-based systems with one of these attributes:

  • There can be multiple valid outcomes of a test with the same set of preconditions and inputs. This makes defining expected results more difficult and can lead to problems:
    • when tests are reused for confirmation tests.
    • when tests are reused for regression testing.
    • when reproducibility of tests is important.
    • when tests are automated.
  • The tester usually needs a deeper knowledge of the required system behavior so that they can develop reasonable tests for passing the test, rather than simply specifying an exact value for the expected test result. Thus, testers may need to define more demanding expected results compared to conventional systems. These expected test results may include tolerances (e.g., “is the actual result within 2% of the optimal solution”).
  • If a single definitive test result is not possible due to the probabilistic nature of the system, the tester must often run a test multiple times to obtain a statistically valid test result.

Challenges in testing complex AI-based systems.

AI-based systems are often used to implement tasks that are too complex for humans. This can lead to a test oracle problem, as testers are not able to determine the expected results as usual (see Section 8.7). For example, AI-based systems are often used to detect patterns in large amounts of data. Such systems are used because they can find patterns that humans cannot find manually, even after intensive analysis. It can be challenging to understand the required behavior of such systems in sufficient depth to achieve the expected results.

A similar problem arises when the internal structure of an AI-based system is generated by software that is too complex for humans to understand. As a result, the AI-based system can only be tested as a black box. Even if the internal structure is visible, this does not provide additional useful information that helps in testing.

The complexity of AI-based systems increases when they produce probabilistic results and are non-deterministic (see Section 8.4).

The problems with nondeterministic systems are exacerbated when an AI-based system consists of multiple interacting components that each provide probabilistic results. For example, a face recognition system is likely to use one model to identify a face in an image and a second model to identify which face was identified. The interactions between AI components can be complex and difficult to understand, making it difficult to identify all risks and develop tests that adequately validate the system.

Testing the Transparency, Interpretability, and Explainability of AI-Based Systems

Information about how the system was implemented can be provided by system developers. This includes the sources of the training data, how the labeling was done, and how the system components were designed. If this information is not available, it can complicate test development. For example, if information about training data is not available, it becomes difficult to identify potential gaps in that data and to test the effects of those gaps. This situation can be compared to black-box and white-box testing and has similar advantages and disadvantages. Transparency can be tested by comparing documented information about the data and algorithm to the actual implementation and determining how well they match.

With ML, it can often be more difficult to explain the relationship between a particular input and a particular output than with traditional systems. This low explainability is primarily due to the fact that the model that generates the output is itself generated by code (the algorithm) and does not reflect the way humans
think about a problem. Different ML models provide different degrees of explainability and should be selected based on the requirements of the system, which may include explainability and testability.

One method for understanding explainability is to dynamically test the ML model when perturbations are applied to the test data. There are methods to quantify and visually explain explainability in this way. Some of these methods are model independent, while others are specific to a particular model type and require access to it. Exploratory testing can also be used to better understand the relationship between inputs and outputs of a model.

The LIME method is model-independent and uses dynamically injected input perturbations and analysis of outputs to give testers insight into the relationship between inputs and outputs. This can be an effective method to explain the model. However, it is limited to providing possible reasons for the results rather than a definitive reason and is not appropriate for all types of algorithms.

The interpretability of an AI-based system depends heavily on who it applies to. Different stakeholders may have different requirements in terms of how well they need to understand the underlying technology.

Measuring and testing understanding for both interpretability and explainability can be challenging because stakeholders have different skill sets and may disagree. In addition, for many types of systems, it can be difficult to identify the profile of typical stakeholders. Where they are conducted, this usually takes the form of user surveys and/or questionnaires.

Test oracle for AI-based systems

A major problem in testing AI-based systems can be the specification of expected results. A test oracle is the source used to determine the expected outcome of a test. One challenge in determining expected results is the so-called test oracle problem.

For complex, non-deterministic or probabilistic systems, it can be difficult to create a test oracle without knowing the “ground truth” (i.e., the actual real-world outcome that the AI-based system is trying to predict). This “ground truth” differs from a test oracle in that a test oracle does not necessarily provide an expected value, but only a mechanism for determining whether or not the system is working correctly.

AI-based systems can evolve (see Section 2.3), and testing self-learning systems (see Section 8.1) can also suffer from test oracle problems because they can change themselves, requiring frequent updates of the system’s functional expectations.

Another cause of difficulty in obtaining an effective test oracle is that the correctness of software behavior is subjective in many cases. Virtual assistants (e.g., Siri and Alexa) are an example of this problem, as different users often have very different expectations and may get different results depending on their choice of words and clarity of speech.

In some situations, it is possible to define the expected outcome with bounds or tolerances. For example, the stopping point for an autonomous car could be defined as being within a maximum distance from a given point. In the context of expert systems, the determination of expected outcomes can be done by consulting an expert (noting that the expert’s opinion may be wrong). In this context, there are several important factors to consider:

  • Human experts vary in competence. The experts involved must be at least as competent as the experts the system is intended to replace.
  • Experts may not agree with each other even when presented with the same information.
  • Human experts may not approve of automating their judgment. In such cases, their evaluation of potential outcomes should be double-blind (i.e., neither the experts nor the evaluators of the outcomes should know which evaluations have been automated).
  • People are more prone to add caveats to their answers (e.g., phrases like “I’m not sure, but…”). If this type of caveat is not available to the AI-based system, this should be taken into account when comparing responses.

There are testing techniques that can mitigate the problem of test oracle, such as A/B testing (see Section 9.4), back-to-back testing (see Section 9.3), and metamorphic testing (see Section 9.5).

Test objectives and acceptance criteria

Test objectives and acceptance criteria for a system must be based on perceived product risks.
These risks can often be identified by analyzing the required quality characteristics. Quality attributes for an AI-based system include those traditionally considered in ISO/IEC 25010 (i.e., functional suitability, performance efficiency, compatibility, usability, reliability, security, maintainability, and portability), but the following should also be considered:

Aspect Acceptance Criteria
  • Verify that the system still functions correctly and meets non-functional requirements when it adapts to a change in its environment. This can be done as a form of automated regression testing.
  • Verification of the time required for the system to adapt to a change in its environment.
  • Verification of the resources consumed when the system adapts to a change in its environment.
  • It is necessary to check how the system behaves in a context outside the original specification. This can take the form of automated regression tests performed in the changed operating environment.
  • Check the time the system takes and/or the resources it needs to adapt to a new context.
  • Check how well the system learns from its own experience.
  • Check how well the system copes when the profile of the data changes (i.e. concept drift).
  • Check how the system reacts when it is forced to leave the operating range in which it should be fully autonomous.
  • Check whether the system can be “persuaded” to request human intervention when it should be fully autonomous.
Transparency, interpretability and explainability
  • Check for transparency by checking the accessibility of the algorithm and dataset.
  • Check interpretability and explainability by interviewing system users, or if actual system users are not available, people with a similar background.
Freedom from inappropriate bias
  • For systems where bias is likely, it can be checked by conducting an independent, unbiased series of tests or by using experts.
  • Compare test results with external data, such as census data, to check for unwanted bias in derived variables (external validity tests).
  • Check the system against an appropriate checklist, such as the EC Trusted AI Scorecard, which supports the key requirements outlined in the Ethical Guidelines for Trusted Artificial Intelligence (AI).
Probabilistic systems and non-deterministic systems
  • This cannot be evaluated with exact acceptance criteria. If the system works correctly, it may give slightly different results in the same tests.
  • Identify potentially harmful side effects and try to generate tests that make the system show these side effects.
Reward Hacking
  • Independent tests can detect reward hacking if those tests use a different means of measuring success than the intelligent agent being tested.
  • This must be carefully checked, perhaps in a virtual test environment (see Section 10.2). This could include attempts to force a system to harm itself.

For ML systems, the required functional ML performance metrics should be specified for the ML model (see Chapter 5).

Source: ISTQB®: Certified Tester AI Testing (CT-AI) Syllabus Version 1.0

Was this article helpful?

Related Articles