Difference average method in sklearn.metrics.classification_report()

前言

在評估模型好壞時，除了accuracy往往也會想看看其他的評估指標

而在sklearn中，有一個很方便的function可以快速取得評估模型的一些量化指標，例如下方程式碼:

1 2	from sklearn.metrics import classification_report classification_report(test_y_true, test_y_pred, digits=3)

會得到:

這張圖到底要怎麼看，可以知道macro, weighted是不同的平均方法，所以最後兩行應該是對各種評估指標用不同的方法進行平均，那位什麼accuracy又放在倒數第三行，且沒有precision和recall呢?

先上結論，accuracy那行其實是在做micro avg
對於micro avg，precision, recall, f1-score是相同的

詳細內容請看下方介紹

F1-score

In general, we prefer classifiers with higher precision and recall scores. However, there is a trade-off between precision and recall: when tuning a classifier, improving the precision score often results in lowering the recall score and vice versa — there is no free lunch.

通常precision和recall之間存在著trade-off，想讓一個好往往就會降低令一個

no free lunch theorem

想用一個評估指標同時衡量precision和recall? F1-score使用了調和平均(harmonic mean)

F1-score = 2 × (precision × recall)/(precision + recall)

調和平均對於較少的數值會有比較大的權重，例如以下的例子:

Similar to arithmetic mean, the F1-score will always be somewhere in between precision and mean. But it behaves differently: the F1-score gives a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%. Or for example, say that Classifier A has precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

F1-score in multi-class

在Multi-class task中，每一個類別都可以算出一個f1-score，那整體的f1-score要如何取得？有幾種平均的方法，假設現在有三類，各自的F1-score分別是:

42.1%
30.8%
66.7%

macro

所有f1-score的算術平均，也就是全部加起來除以數量，也就是:

Macro-F1 = (42.1% + 30.8% + 66.7%) / 3 = 46.5%

macro-f1是unweighted的，對於每個類別的權重是相同的

也就是說，在imbalanced的分佈下，數量小的類別對於整體的影響也是相同的

weighted

假設我們有25個樣本: 分別數量為6, 10, 9

則Weighted-F1的算法為

Weighted-F1 = (6 × 42.1% + 10 × 30.8% + 9 × 66.7%) / 25 = 46.4%

對於每一類的f1，類別的數量會影響到該類別的權重

也就是說，在imbalanced的分佈下，數量大的類別將會大大的影響整體的f1效果

micro

先算整體的percision和recall，再算micro-f1

percision = (TP/(TP+FP))

recall = (TP/(TP+FN))

考慮在多類別情況的confusion matrix

TP就是對角線上的數字(正確的分對)
FP/FN就是非對角線上的其他數字，confusion matrix(A, B)可以想成:
- 對於某個positive類別A他卻分錯B(FP)
- 對於某個negative類別B他卻分錯A(FN)

所以在micro下的percision和recall會是相同的，實際上，就連accuracy也是相同的

accuracy不就是confusion上所有數量中，分對的數量?

Reference

Multi-Class Metrics Made Simple, Part II: the F1-score