Difference average method in sklearn.metrics.classification_report()

Posted by John on 2020-04-27
Words 800 and Reading Time 3 Minutes
Viewed Times

前言

在評估模型好壞時,除了accuracy往往也會想看看其他的評估指標

而在sklearn中,有一個很方便的function可以快速取得評估模型的一些量化指標,例如下方程式碼:

1
2
from sklearn.metrics import classification_report
classification_report(test_y_true, test_y_pred, digits=3)

會得到:

這張圖到底要怎麼看,可以知道macro, weighted是不同的平均方法,所以最後兩行應該是對各種評估指標用不同的方法進行平均,那位什麼accuracy又放在倒數第三行,且沒有precision和recall呢?

  • 先上結論,accuracy那行其實是在做micro avg
  • 對於micro avg,precision, recall, f1-score是相同的

詳細內容請看下方介紹

F1-score

In general, we prefer classifiers with higher precision and recall scores. However, there is a trade-off between precision and recall: when tuning a classifier, improving the precision score often results in lowering the recall score and vice versa — there is no free lunch.

通常precision和recall之間存在著trade-off,想讓一個好往往就會降低令一個

  • no free lunch theorem

想用一個評估指標同時衡量precision和recall? F1-score使用了調和平均(harmonic mean)

F1-score = 2 × (precision × recall)/(precision + recall)

調和平均對於較少的數值會有比較大的權重,例如以下的例子:

Similar to arithmetic mean, the F1-score will always be somewhere in between precision and mean. But it behaves differently: the F1-score gives a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%. Or for example, say that Classifier A has precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

F1-score in multi-class

在Multi-class task中,每一個類別都可以算出一個f1-score,那整體的f1-score要如何取得?有幾種平均的方法,假設現在有三類,各自的F1-score分別是:

  • 42.1%
  • 30.8%
  • 66.7%

macro

所有f1-score的算術平均,也就是全部加起來除以數量,也就是:

Macro-F1 = (42.1% + 30.8% + 66.7%) / 3 = 46.5%

macro-f1是unweighted的,對於每個類別的權重是相同的

  • 也就是說,在imbalanced的分佈下,數量小的類別對於整體的影響也是相同的

weighted

假設我們有25個樣本: 分別數量為6, 10, 9

則Weighted-F1的算法為

Weighted-F1 = (6 × 42.1% + 10 × 30.8% + 9 × 66.7%) / 25 = 46.4%

對於每一類的f1,類別的數量會影響到該類別的權重

  • 也就是說,在imbalanced的分佈下,數量大的類別將會大大的影響整體的f1效果

micro

先算整體的percision和recall,再算micro-f1

percision = (TP/(TP+FP))

recall = (TP/(TP+FN))

考慮在多類別情況的confusion matrix

  • TP就是對角線上的數字(正確的分對)
  • FP/FN就是非對角線上的其他數字,confusion matrix(A, B)可以想成:
    • 對於某個positive類別A他卻分錯B(FP)
    • 對於某個negative類別B他卻分錯A(FN)

所以在micro下的percision和recall會是相同的,實際上,就連accuracy也是相同的

  • accuracy不就是confusion上所有數量中,分對的數量?

Reference


>