Metrics

Ankit kumar

7 min readJan 5, 2024

—To measure the performance of machine-learning models.

Classification Metrics

Accuracy:

It is defined as the number of correct predictions divided by the total number of predictions.

Accuracy = TP+TN/ (TP+TN+FN+FP)

Balanced Accuracy:

Balanced Accuracy is also useful for unbalanced dataset. Here, balanced accuracy is the average of Recall score obtained in each class, i.e. the macro average of recall scores per class. So, for a balanced dataset, the scores tend to be the same as Accuracy.

Precision

When the class distribution is imbalanced (one class is more frequent than others), then accuracy is not a good indicator of your model performance. In this case, even if you predict all samples as the most frequent class you would get a high accuracy rate, which does not make sense at all (because your model might be overfitted to the most frequent class, and might not be generalized to less frequent classes.)

Therefore we need to look at class-specific performance metrics too. Precision is one such metric, which is defined as the correct prediction for a particular class divided by the number of samples predicted as that particular class.

Precision= True_Positive/ (True_Positive+ False_Positive)

Recall

Recall is another important metric, which is defined as the fraction of samples from a class that are correctly predicted by the model.

Recall= True_Positive/ (True_Positive+ False_Negative)

F1-Score

Depending on the business objective, you may want to give higher priority to recall or precision. However, there are many use cases in which both recall and precision are important. Therefore, we require a metric that combines these two metrics. One popular metric that combines precision and recall is called F1-score, which is the harmonic mean of precision and recall defined as:

F1-score= 2*Precision*Recall/(Precision+Recall)

ROC Curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

TPR = True_Positive/ (True_Positive+ False_Negative)

False Positive Rate (FPR) is defined as follows:

FPR = False_Positive/ (False_Positive+ True_Negative)

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

AUC

AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has an AUC near 0 which means it has the worst measure of separability. In fact, it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no class separation capacity whatsoever.

Regression Metrics:

Mean Absolute Error

Mean absolute error (or mean absolute deviation) is another metric which finds the average absolute distance between the predicted and target values. MAE is define as below:

MAE is known to be more robust to the outliers than MSE. The main reason being that in MSE by squaring the errors, the outliers (which usually have higher errors than other samples) get more attention and dominance in the final error and impacting the model parameters.

Mean Squared Error

It is the mostly used in the regressions problems. It is defined as the average squared error between the predicted and actual values.

Metrics to measure the performance of Sematic Segmentation Models

Pixel Accuracy

Pixel accuracy is perhaps the easiest to understand conceptually. It is the percent of pixels in your image that are classified correctly.

Well, high pixel accuracy doesn’t always imply superior segmentation ability, because of class imbalance issue. When our classes are extremely imbalanced, it means that a class or some classes dominate the image, while some other classes make up only a small portion of the image. Unfortunately, class imbalance is prevalent in many real world data sets, so it can’t be ignored.

Intersection-Over-Union (IoU)

The Intersection-Over-Union (IoU), also known as the Jaccard Index, is one of the most commonly used metrics in semantic segmentation… and for good reason. The IoU is a very straightforward metric that’s extremely effective.

The IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth, as shown on the image to the left. This metric ranges from 0–1 (0–100%) with 0 signifying no overlap and 1 signifying perfectly overlapping segmentation.

Dice Coefficient (F1-Score):

Dice coefficient is calculated from the precision and recall of a prediction. Then, it scores the overlap between predicted segmentation and ground truth. It also penalize false positives, which is a common factor in highly class imbalanced datasets like MIS.

Based on the F-measure, there are two popular utilized metrics in MIS:

The Intersection-over-Union (IoU), also known as Jaccard index or Jaccard similarity coefficient
The Dice similarity coefficient (DSC), also known as F1-score or Sørensen-Dice index: most used metric in the large majority of scientific publictions for MIS evaluation

The difference between the two metrics is that the IoU penalizes under- and over-segmentation more than DSC.

Dice metric

Dice coefficient = F1 score: a harmonic mean of precision and recall. In other words, it is calculated by 2*intersection divided by the total number of pixel in both images.

Metrics to measure the performance of Object Detection Models

Precision-Recall Curve:

The Precision-Recall curve is a good way to evaluate the performance of an object detector as the confidence is changed. There is a curve for each object class. An object detector of a particular class is considered good if its prediction stays high as recall increases, which means that if you vary the confidence threshold, the precision and recall will still be high. This statement can be more intuitively understood by looking at the above equations of P and R and keeping in mind that TP+FN = all ground truth = constant, so Recall increases, means TP increased, hence FN will decrease. As TP has increased, only if FP decreases, will the Precision remain high i.e. the model will be doing less mistakes and hence will be good. Usually, Precision x Recall curve start with high precision values, decreasing as recall increases. You can see an example of the Precision x Recall curve in the next topic (Average Precision).

Average Precision(AP):

It is calculated using area under the curve (AUC) of the Precision x Recall curve. As AP curves are often zigzag curves, comparing different curves (different detectors) in the same plot usually is not an easy task. In practice AP is the precision averaged across all recall values between 0 and 1.

Mean Average Precision(mAP):

The mAP score is calculated by taking the mean AP over all classes and/or over all IoU thresholds, depending on the competition.

COCO mAP

For the COCO 2017 challenge, the mAP was calculated by averaging the AP over all 80 object categories AND all 10 IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The authors hypothesize that averaging over IoUs rewards detectors with better localization.

Metrics to measure the performance of Image generative Models (GAN, Diffusion models)

Two important properties to consider :
Fidelity: quality of generated images
Diversity: Variety of generated images.

Inception Score

IS is a measure of how realistic the generative model outputs are. It aims to measure two things in generated output:-

Diversity: Images should have variety
Sharpness: Each image should clearly depict some object

If both aims are satisfied only then the score will be high, otherwise low. Mathematically:-

FID — Frechet Inception Distance

The IS score do not take into account the real data distribution. FID measure similarity in feature representation of data sampled and test data.

Computing FID has following procedure:-

Let G denote the generated samples and T denote the test dataset
Compute feature representations FG and FT for G and T respectively (e.g., prefinal layer of Inception Net)
Fit a multivariate Gaussian to each of FG and FT. Let (μG,ΣG) and (μT , ΣT) denote the mean and covariances of the two Gaussians
FID is defined as the Wasserstein-2 distance between these two Gaussians

CLIP-Score:

Text-Image similarity -> Calculate visual-semantic alignment between the text descriptions and manipulated images, by finding the cosine similarity between their embeddings extracted with CLIP text/image encoders.

Metrics are used to measure the performance of deep-learning models.