前言
在評估模型好壞時,除了accuracy往往也會想看看其他的評估指標
而在sklearn中,有一個很方便的function可以快速取得評估模型的一些量化指標,例如下方程式碼:
1 | from sklearn.metrics import classification_report |
會得到:
這張圖到底要怎麼看,可以知道macro, weighted是不同的平均方法,所以最後兩行應該是對各種評估指標用不同的方法進行平均,那位什麼accuracy又放在倒數第三行,且沒有precision和recall呢?
- 先上結論,accuracy那行其實是在做micro avg
- 對於micro avg,precision, recall, f1-score是相同的
詳細內容請看下方介紹
F1-score
In general, we prefer classifiers with higher precision and recall scores. However, there is a trade-off between precision and recall: when tuning a classifier, improving the precision score often results in lowering the recall score and vice versa — there is no free lunch.
通常precision和recall之間存在著trade-off,想讓一個好往往就會降低令一個
- no free lunch theorem
想用一個評估指標同時衡量precision和recall? F1-score使用了調和平均(harmonic mean)
F1-score = 2 × (precision × recall)/(precision + recall)
調和平均對於較少的數值會有比較大的權重,例如以下的例子:
Similar to arithmetic mean, the F1-score will always be somewhere in between precision and mean. But it behaves differently: the F1-score gives a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%. Or for example, say that Classifier A has precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.
F1-score in multi-class
在Multi-class task中,每一個類別都可以算出一個f1-score,那整體的f1-score要如何取得?有幾種平均的方法,假設現在有三類,各自的F1-score分別是:
- 42.1%
- 30.8%
- 66.7%
macro
所有f1-score的算術平均,也就是全部加起來除以數量,也就是:
Macro-F1 = (42.1% + 30.8% + 66.7%) / 3 = 46.5%
macro-f1是unweighted的,對於每個類別的權重是相同的
- 也就是說,在imbalanced的分佈下,數量小的類別對於整體的影響也是相同的
weighted
假設我們有25個樣本: 分別數量為6, 10, 9
則Weighted-F1的算法為
Weighted-F1 = (6 × 42.1% + 10 × 30.8% + 9 × 66.7%) / 25 = 46.4%
對於每一類的f1,類別的數量會影響到該類別的權重
- 也就是說,在imbalanced的分佈下,數量大的類別將會大大的影響整體的f1效果
micro
先算整體的percision和recall,再算micro-f1
percision = (TP/(TP+FP))
recall = (TP/(TP+FN))
考慮在多類別情況的confusion matrix
- TP就是對角線上的數字(正確的分對)
- FP/FN就是非對角線上的其他數字,confusion matrix(A, B)可以想成:
- 對於某個positive類別A他卻分錯B(FP)
- 對於某個negative類別B他卻分錯A(FN)
所以在micro下的percision和recall會是相同的,實際上,就連accuracy也是相同的
- accuracy不就是confusion上所有數量中,分對的數量?