Tom Kwok

IoU vs F1 score

plot

Background #

Intersection over Union (IoU) and F1 score are commonly used evaluation metrics in binary classification, such as for object detection and image segmentation.

Denote the number of true positives by TP, the number of false positives by FP and the number of false negatives by FN. Note that the term for the number of true negatives TN is not present in the expression for either IoU or F1 score, which implies that neither IoU nor F1 is symmetric in the positive and negative class.

Summary #

In the following we show that IoU and F1 score (both of which can be expressed in terms of TP, FP and FN) can be re-expressed in terms of:

Contour plots are made to provide visualization for intuitive understanding. Note that as per convention, variable x is on the horizontal axis, variable y is on the vertical axis, and z slices are the contour lines.

For easy comparison, the contour plots for (i) IoU, (ii) F1 and (iii) harmonic mean of IoU and F1 are combined to produce the following animated sequence.

Contour plots for IoU, F1 and harmonic mean of IoU and F1 in animated sequence

This is not to be confused with Fβ scores. An animated sequence of plots for Fβ scores with different values of β is produced for comparison.

Contour plots for Fβ scores with different values of β in animated sequence


IoU #

Intersection over Union is also known as the Jaccard index, which is generalized by the Tversky index. It can be re-expressed in terms of precision and recall.

Derivation of IoU in terms of precision and recall

We can observe that IoU score measures something closer to the worst case, i.e. the minimum, of precision and recall.

Contour plot of Intersection over Union (IoU) as a function of precision and recall Contour plot of z = min(x, y)


F1 score #

F1 score is also known as the Dice coefficient. It is by definition the harmonic mean of precision and recall.

Derivation of F1 in terms of precision and recall

We can observe that F1 score measures something closer to the average of precision and recall. This is apparent especially for precision values and recall values that are both greater than around 0.5.

Contour plot of F1 score as a function of precision and recall Contour plot of z = (x + y) / 2


Fβ score #

F1 score is generalized by Fβ score, which measures something close to the weighted average of precision and recall, where the effect of change of recall is β times as much as that of precision. The proof for why β2 instead of β is used in the formula for Fβ can be found here.

Derivation of F beta in terms of precision and recall

Plots are generated for Fβ for β = 0.5 and for β = 2 together with linear plots for their arithmetic mean analogies. We can observe that the plots are asymmetric.

Contour plot of F0.5 score as a function of precision and recall Contour plot of z = (x + 0.5y) / 1.5

Contour plot of F2 score as a function of precision and recall Contour plot of z = (x + 2y) / 3


New metric #

In the following we shift our focus back on the metrics that are symmetric in precision and recall. By expressing IoU and F1 in terms of TP, FP and FN, we observe that we can take the harmonic mean of IoU and F1 to devise a new metric that can be intuitively understood as something in between IoU and F1.

IoU, F1 and the new metric in terms of TP, FP and FN

The new metric can also be re-expressed in terms of precision and recall for generation of the following contour plot. We can observe that the new metric, which as mentioned is defined as the harmonic mean of IoU and F1, measures something close to the average of (i) the worst case of precision and recall and (ii) the average of precision and recall.

Contour plot of the proposed new metric as a function of precision and recall Contour plot of z = (min(x, y) + ((x + y) / 2)) / 2


Inspiration #

This post is inspired by an answer in a StackExchange post:

"For any fixed ground truth, the two metrics are always positively correlated." ... "F score tends to measure something closer to average performance, while the IoU score measures something closer to the worst case performance" ... "over a set of inferences."


Further work #

A new thesis pre-proposal in PDF format ↓ has been prepared with the gradients of and Fβ loss and proposed Gα loss.


Changelog