Accuracy metrics: the binary vs multiclass case

One thing I didn’t make sufficiently clear (and which our in-class multiclass XP example unfortunately probably didn’t help) is how metrics are treated differently for binary classification vs. multiclass classification.

Here’s the deal. Whenever you perform a classification task, you have one of the following two scenarios:

Binary. You have only one “thing” you’re trying to detect. Example: you’re detecting “politically polarized texts.” (Everything else is a “not-politically-polarized text.”)
Multiclass. You have multiple “things” you’re trying to detect. Example: you’re detecting whether a Federalist Paper was authored by Hamilton, Madison, or Jay.

In the binary case, one normally designates one of the two options as the “primary option” (for instance, “politically-polarized”) and computes precision, recall, and F1-score based on only that primary option. One does not normally compute precision/recall/F1-score for “politically-polarized” and also precision/recall/F1-score for “not politically polarized” and then use micro- or macro-averaging.

The only time you need to (and should) use micro/macro-averaging is in the multiclass case, when you have more than two labels you’re classifying everything in. Then, the only real way to take into account “how well do I do in identifying Hamilton? Madison? Jay?” is to compute three separate precision/recall/F1-scores and average them.

It’s quite possible that I didn’t make this sufficiently clear, and that the fact that we did a multiclass example in lecture reinforced the idea that you always needed to compute separate metrics and average them, even in the binary case.

All this to say: if on Quiz #3 — which had a binary classification example (“passive-aggressive” or not) — you did the multiclass technique of computing scores for “passive-aggressive” and “non-passive-aggressive” separately and then averaging them, I will forgive this venial sin and give you your points back for that. If this is the case, please send me an email with the number of XP you missed for that reason and I’ll post on the scoreboard.

Natural Language Processing

Accuracy metrics: the binary vs multiclass case

Leave a Reply Cancel reply