Confusion Matrix
In a classification problem, a model will rarely always predict the results correctly. For such a problem, a confusion matrix can be created with the following possibilities:
Figure 2: Confusion Matrix
Note that the confusion matrix shown in Figure 2 can be represented in different ways, but always produces values for the four possible situations of true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
Based on the confusion matrix, the following metrics are defined:
- Accuracy
Accuracy = (TP + TN) / (TP +TN + FP + FN) * 100%.
Accuracy measures the percentage of all correct classifications. - Precision
Precision = TP / (TP + FP) * 100%
Precision measures the percentage of positive results that were correctly predicted. It is a measure of how confident one can be in positive predictions. - Recall
Recall = TP / (TP + FN) * 100%
Recall (also called sensitivity) measures the proportion of actual positive results that were correctly predicted. It is a measure of how confident one can be of not missing positive results. - F1 score
F1-Score = 2* (Precision * Recall) / (Precision + Recall)
The F1-Score is calculated as the harmonic mean of Precision and Recall. It has a value between zero and one. A value close to one indicates that incorrect data has a small effect on the result. A low F1 score indicates that the model performs poorly in detecting positive data.
Additional ML functional performance metrics for classification, Regression, and Clustering
There are numerous metrics for different types of ML problems (in addition to the metrics for classification described in Section 5.1). Some of the most commonly used metrics are described below.
Supervised Classification Metrics
- The Receiver Operating Characteristic (ROC) curve is a graphical representation that illustrates the ability of a binary classifier as its discrimination threshold varies. The method was originally developed for military radars, which is why it is so named. The ROC curve is plotted with the true positive rate (TPR) (also called recall) against the false positive rate (FPR = FP / (TN + FP)), with TPR on the y-axis and FPR on the x-axis.
- The area under the curve (AUC) is the area under the ROC curve. It represents the degree of separability of a classifier and shows how well the model distinguishes between classes. The higher the AUC, the better the predictions of the model.
Supervised Regression Metrics
- For supervised regression models, the metrics represent how well the regression line matches the actual data points.
- The mean squared error (MSE) is the average of the squared differences between the actual value and the predicted value. The value of the MSE is always positive, and a value closer to zero indicates a better regression model. Squaring the difference ensures that positive and negative errors do not cancel each other out.
- R-squared (also known as the coefficient of determination) is a measure of how well the regression model fits the dependent variables.
Unsupervised Clustering Metrics
For unsupervised clustering, there are several metrics that represent the distances between different clusters and the proximity of data points within a given cluster.
- Intra-cluster metrics measure the similarity of data points within a cluster.
- Inter-cluster metrics measure the similarity of data points in different clusters.
- The silhouette coefficient (also known as the silhouette score) is a measure (between -1 and +1) based on the average distances between and within clusters. A value of +1 means that the clusters are well separated, a value of zero means random clustering, and a value of -1 means that the clusters are mismatched.
Limitations of the functional ML performance metrics.
The functional ML performance metrics are limited to measuring the functionality of the model, e.g., in terms of accuracy, precision, recall, MSE, AUC, and the silhouette coefficient. They do not measure other non-functional quality attributes, such as those defined in ISO 25010 (e.g., performance efficiency) and those described in Chapter 2 (e.g., explainability, flexibility, and autonomy). In this syllabus, the term “ML functional performance metrics” is used because the term “performance metrics” is widely used for these functional metrics. The addition of “ML functional” emphasizes that these metrics are specific to machine learning and have no relationship to performance efficiency metrics.
The functional ML performance metrics are constrained by several other factors:
- In supervised learning, the functional ML performance metrics are computed based on labeled data, and the accuracy of the resulting metrics depends on correct labeling (see Section 4.5).
- The data used for measurement may not be representative (e.g., they may be biased), and the generated functional ML performance metrics depend on these data (see Section 2.4).
- The system may consist of multiple components, but the functional ML performance metrics apply only to the ML model. For example, the data pipeline is not considered in the evaluation of the model by the functional ML performance metrics.
- Most functional ML performance metrics can only be measured with tool support.
Selection of functional ML performance metrics
Typically, it is not possible to create an ML model that achieves the highest score for all functional ML performance metrics generated from a confusion matrix. Instead, the most appropriate functional ML performance metrics are selected as acceptance criteria based on the expected use of the model (e.g., to minimize false positives, a high precision score is required, while to minimize false negatives, the recall score should be high). The following criteria can be used in selecting the functional ML performance metrics described in Sections 5.1 and 5.2:
- Accuracy: This metric is likely to be applicable if the datasets are symmetric (i.e., the number of false positives and false negatives and the cost are similar). This metric is inappropriate when one class of data dominates over the others. In this case, the F1 score should be used.
- Precision: This may be an appropriate metric when the cost of false positives is high and confidence in positive results must be high. A spam filter (where the classification of an email as spam is considered positive) is an example where high precision is required, as most users would not accept too many emails being moved to the spam folder that are not actually spam. If the classifier has to deal with situations where a very large percentage of cases are positive, using precision alone is probably not a good choice.
- Recall: When it is critical that positive results are not overlooked, a high recall value is important. For example, in cancer detection, it would be unacceptable to miss any true positives and mark them as negative (i.e., no cancer was detected).
- F1 score: The F1 score is most useful when there is an imbalance between expected classes and when precision and recall are of similar importance.
In addition to the above metrics, several metrics are described in Section 5.2. These may be applicable to specific ML problems, for example:
- AUC for ROC curve can be used for supervised classification problems.
- MSE and R-squared can be used for supervised regression problems.
- Inter-cluster metrics, intra-cluster metrics, and the silhouette coefficient can be used for unsupervised clustering problems.
Benchmark Suites for ML
New AI technologies such as new datasets, algorithms, models, and hardware are released regularly, and it can be difficult to determine the relative efficiency of each new technology.
To enable objective comparisons between these different technologies, industry-standard ML benchmark suites exist. These cover a wide range of application areas and provide tools to evaluate hardware platforms, software frameworks, and cloud platforms for AI and ML performance.
ML benchmark suites can provide various metrics, including training times (e.g., how fast a framework can train an ML model with a defined training dataset to a certain target quality metric, such as 75% accuracy) and inference times (e.g., how fast a trained ML model can perform inference).
ML benchmark suites are offered by several organizations, e.g:
- MLCommons: This is a nonprofit organization founded in 2020, previously called ML Perf, that provides benchmarks for software frameworks, AI-specific processors, and ML cloud platforms.
- DAWNBench: This is an ML benchmark suite from Stanford University.
- MLMark: This is an ML benchmark suite for measuring embedded inference performance and accuracy from the Embedded Microprocessor Benchmark Consortium.
Source: ISTQB®: Certified Tester AI Testing (CT-AI) Syllabus Version 1.0