为什么Precision_Recall_Curve()返回的值与混淆矩阵不同？-Python问题

Why does precision_recall_curve() return different values than confusion matrix?(为什么Precision_Recall_Curve()返回的值与混淆矩阵不同？)

本文介绍了为什么Precision_Recall_Curve()返回的值与混淆矩阵不同？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了以下代码来计算多类分类问题的精度和召回率：

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score

def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return idx

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(
    svm.SVC(kernel="linear", probability=True, random_state=random_state)
)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)

# Confusion matrix
from sklearn.metrics import classification_report
y_test_pred =  classifier.predict(X_test)
print(classification_report(y_test, y_test_pred))

# Compute ROC curve and ROC area for each class
precision = dict()
recall = dict()
threshold = dict()
for i in range(n_classes):
    c = classifier.classes_[i]
    precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
    th0 = find_nearest(threshold[c], 0)
    print(c, round(precision[c][th0],2), round(recall[c][th0], 2))

我要做的是重新计算混淆矩阵所显示的精确度和召回率

precision    recall  f1-score   support

           0       0.73      0.52      0.61        21
           1       1.00      0.07      0.12        30
           2       0.57      0.33      0.42        24

   micro avg       0.68      0.28      0.40        75
   macro avg       0.77      0.31      0.39        75
weighted avg       0.79      0.28      0.36        75
 samples avg       0.28      0.28      0.28        75

使用precision_recall_curve()函数。理论上，当阈值等于0时，它应该返回与混淆矩阵完全相同的结果。但是，我的结果与最终结果不匹配：

  precsion recall
0     0.75   0.57
1      1.0    0.1
2      0.6   0.38

您能否解释这种差异，以及如何正确计算混淆矩阵报告的值？

推荐答案

正如我在评论中所写的，考虑索引th0 + 1而不是索引th0可以解决您的问题。然而，这可能只是一种情况(因为在这个特定的例子中，接近0的阈值总是对应于负分数)；因此，对于编程方法，您应该修改find_nearest以返回threshold为正且最接近0的索引。事实上，您可以通过添加

print(th0, threshold[c][th0-1], threshold[c][th0], threshold[c][th0+1])

您将获得以下输出：

20 -0.011161920989200713 -0.01053513227868108 0.016453546101096173
67 -0.04226738229343663 -0.0074193008862454835 0.09194626401603534
38 -0.011860865951094923 -0.003756310149749531 0.0076752136658660985

要获得更具程序化的方法，您可以简单地按如下方式修改find_nearest并将索引th0保留在您的循环中。

def find_nearest_new(array, value):
    array = np.asarray(array)
    idx = (np.abs(np.where(array > 0, array, 999) - value)).argmin()
    return idx
...
for i in range(n_classes):
    c = classifier.classes_[i]
    precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
    th0 = find_nearest_new(threshold[c], 0)
    print(c, round(precision[c][th0],6), round(recall[c][th0], 6), round(threshold[c][th0],6))

我的线索是这样的，即在precision_recall_curve实现中，精度和查全率的定义如下：

精度：ndarray形状(n_Thresholds+1，) 精确值，元素i是Score>；=Thresholds[i]的预测精度，最后一个元素是1。

Recall：ndarray of Shape(n_Thresholds+1，) 降低召回值，以便元素I是Score&>=Thresholds[i]的预测，最后一个元素为0。
换句话说，如果您按降序对分数进行排序(根据实现)，您将看到所选的阈值(无论您是否考虑索引th0 + 1)与每个类的第一个正分数一致(实际上，阈值就是不同的评分值)。另一方面，如果您坚持索引th0(在此特定示例中)，您将获得严格小于Threshold=0的分数。

for i in range(n_classes): c = classifier.classes_[i] precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c]) th0 = find_nearest(threshold[c], 0) print(c, round(precision[c][th0+1],6), round(recall[c][th0+1], 6), round(threshold[c][th0+1],6)) #print(c, precision[c], recall[c], threshold[c]) print(np.sort(y_score[:,c])[::-1])

This post可能有助于了解precision_recall_curve()中的工作原理。

这篇关于为什么Precision_Recall_Curve()返回的值与混淆矩阵不同？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！

问题描述

推荐答案

基础教程推荐