For performance-related risks to product quality, the process is as follows:
- Identify risks to product quality, focusing on characteristics such as timing, resource utilization, and capacity.
- Evaluate identified risks to ensure that relevant architectural categories are addressed. Assessing the overall risk for each identified risk in terms of likelihood and impact against clearly defined criteria.
- Take appropriate mitigation actions for each risk factor based on the nature of the risk factor and the level of risk.
- Continuously manage risk to ensure that risks are adequately mitigated prior to release.
As with quality risk analysis in general, participants in this process should include both business and technical stakeholders. When analyzing performance-related risks, business stakeholders must include those with a particular awareness of how performance problems in production will actually affect customers, users, the business, and other downstream stakeholders. Business stakeholders must recognize that the intended use, business, societal, or safety significance, potential financial and/or reputational damage, civil or criminal liability, and similar factors will affect risk from a business perspective, create risk, and influence the impact of failures.
In addition, technical stakeholders must include those with a deep understanding of the performance implications of relevant requirements, architecture, design, and implementation decisions. Technical stakeholders must understand that architecture, design, and implementation decisions affect performance risk from a technical perspective, create risk, and influence the likelihood of failure.
The risk analysis process selected should have an appropriate level of formality and rigor. For performance-related risks, it is particularly important that the risk analysis process be started early and repeated regularly. In other words, the tester should avoid relying solely on performance tests performed toward the end of the system test and system integration test levels. Many projects, especially larger and more complex systems projects, have experienced unpleasant surprises because performance deficiencies resulting from requirements, design, architecture, and implementation decisions made early in the project were discovered too late. The focus should therefore be on an iterative approach to identifying, assessing, mitigating, and managing performance risks throughout the software development lifecycle.
For example, when processing large amounts of data through a relational database, the slow performance of many-to-many joins due to poor database design may not become apparent until dynamic testing with large test data sets, such as those used in system testing. However, a careful technical review involving experienced database engineers can predict the problems before the database is implemented. After such a review, risks are re-identified and re-evaluated in an iterative approach.
In addition, risk mitigation and management must encompass and influence the entire software development process, not just dynamic testing. For example, if critical performance-related decisions such as the expected number of transactions or concurrent users cannot be determined early in the project, it is important that design and architecture decisions enable highly variable scalability (e.g., cloud-based compute resources on demand). This allows early decisions to be made to mitigate risk.
Good performance engineering can help project teams avoid late discovery of critical performance defects during higher levels of testing, such as system integration testing or user acceptance testing. Performance defects discovered late in a project can be extremely costly and even lead to the cancellation of entire projects.
As with any type of quality risk, performance-related risks can never be completely avoided, i.e., some risk of performance-related production defects will always exist. Therefore, the risk management process must include a realistic and specific assessment of the residual risk for the business and technical stakeholders involved in the process. For example, it is not helpful to simply say, “Yes, it is still possible that customers will experience long wait times at checkout,” as this does not indicate the extent to which the risk has already been mitigated or the level of residual risk. Instead, it is helpful to state the percentage of customers who may experience delays that meet or exceed a certain threshold to understand the status.