 # IoU vs F1 score

plot

### Background #

Intersection over Union (IoU) and F1 score are commonly used evaluation metrics in binary classification, such as for object detection and image segmentation.

Denote the number of true positives by TP, the number of false positives by FP and the number of false negatives by FN. Note that the term for the number of true negatives TN is not present in the expression for either IoU or F1 score, which implies that neither IoU nor F1 is symmetric in the positive and negative class.

### Summary #

In the following we show that IoU and F1 score (both of which can be expressed in terms of TP, FP and FN) can be re-expressed in terms of:

• precision (which can be expressed in terms of TP and FP); and
• recall (which can be expressed in terms of TP and FN).

Contour plots are made to provide visualization for intuitive understanding. Note that as per convention, variable x is on the horizontal axis, variable y is on the vertical axis, and z slices are the contour lines.

For easy comparison, the contour plots for (i) IoU, (ii) F1 and (iii) harmonic mean of IoU and F1 are combined to produce the following animated sequence. This is not to be confused with Fβ scores. An animated sequence of plots for Fβ scores with different values of β is produced for comparison. ### IoU #

Intersection over Union is also known as the Jaccard index, which is generalized by the Tversky index. It can be re-expressed in terms of precision and recall. We can observe that IoU score measures something closer to the worst case, i.e. the minimum, of precision and recall.  ### F1 score #

F1 score is also known as the Dice coefficient. It is by definition the harmonic mean of precision and recall. We can observe that F1 score measures something closer to the average of precision and recall. This is apparent especially for precision values and recall values that are both greater than around 0.5.  ### Fβ score #

F1 score is generalized by Fβ score, which measures something close to the weighted average of precision and recall, where the effect of change of recall is β times as much as that of precision. The proof for why β2 instead of β is used in the formula for Fβ can be found here. Plots are generated for Fβ for β = 0.5 and for β = 2 together with linear plots for their arithmetic mean analogies. We can observe that the plots are asymmetric.    ### New metric #

In the following we shift our focus back on the metrics that are symmetric in precision and recall. By expressing IoU and F1 in terms of TP, FP and FN, we observe that we can take the harmonic mean of IoU and F1 to devise a new metric that can be intuitively understood as something in between IoU and F1. The new metric can also be re-expressed in terms of precision and recall for generation of the following contour plot. We can observe that the new metric, which as mentioned is defined as the harmonic mean of IoU and F1, measures something close to the average of (i) the worst case of precision and recall and (ii) the average of precision and recall.  ### Inspiration #

This post is inspired by an answer in a StackExchange post:

"For any fixed ground truth, the two metrics are always positively correlated." ... "F score tends to measure something closer to average performance, while the IoU score measures something closer to the worst case performance" ... "over a set of inferences."

### Further work #

A new thesis pre-proposal in PDF format ↓ has been prepared with the gradients of and Fβ loss and proposed Gα loss.

Changelog

• Feb 2021 Replaced plots with new versions generated with my new custom rainbow color map and re-exported in SVG format using my new tool svgasm.
• Feb 2021 Added a sequence of plots for Fβ scores in animated format.
• Jan 2021 Added a sequence of plots in animated format in summary section.
• Jan 2021 Updated all LaTeX graphics to use sans serif typeface instead of the default serif typeface with `sansmath`. This improves readability on screen and reduces total file size of the graphics by 17%. 