Inter-rater reliability calculation for multi-raters data(多评价者数据的评价者间可靠性计算)
问题描述
我有以下列表:
[[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 2, 0, 0, 1],[1, 1, 0, 2, 3, 1, 0, 1]]
在我想计算 inter-rater 可靠性分数的地方,有多个评分者(行).我不能使用 Fleiss 的 kappa,因为行的总和不是相同的数字.在这种情况下,什么是好的方法?
是的,数据准备是这里的关键.一起来看看吧.
虽然 Krippendorff 的 alpha 可能因多种原因而优越,但 numpy 和 statsmodels 提供了从上述表格中获取 Fleiss kappa 所需的一切.尽管 Krippendorff alpha 提供了几乎相同的结果如果使用得当,但 Fleiss 的 kappa 在医学研究中更为普遍.如果它们提供了完全不同的结果,这可能是由于一些用户错误,最重要的是输入数据的格式和测量级别(例如序数与名义) – 跳过解决方案(转置&aggregate): Fleiss kappa 0.845
密切注意哪个轴代表主题、评分者或类别!
弗莱斯的河童
statsmodels.stats 将 inter_rater 导入为 irr
原始数据将评分者作为行,将主题作为列,用整数表示指定的类别(如果我没记错的话).
我删除了一行,因为有 4 行和 4 个类别,这可能会混淆情况 - 所以现在我们有 4 个 [0,1,2,3] 类别和 3 行.
orig = [[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 2, 0, 0, 1]]
来自aggregate_raters() 函数的文档
将形状为 (subject, ratingr) 的原始数据转换为 (subject, cat_counts)"
irr.aggregate_raters(orig)
返回:
<块引用>(array([[2, 5, 0, 1],[2, 5, 0, 1],[2, 5, 1, 0]]),数组([0, 1, 2, 3]))
现在……orig 数组中的行数等于第一个返回数组 (3) 中的行数.列数现在等于类别数([0,1,2,3] -> 4).每行的内容加起来为 8,这等于 orig 输入数据中的列数 - 假设每个评分者对每个主题进行评分.此聚合显示了评估者如何分布在每个主题(行)的类别(列)中.(如果在类别 2 上一致,我们会看到 [0,0,8,0];或类别 0 [8,0,0,0].
该函数期望行是主题.查看主题数量如何没有变化(3 行).对于每个主题,它通过查看"在行中找到类别(数字)的次数来计算每个类别被分配的次数.对于第一行或类别 0 被分配了两次,1 次被分配了 5 次,2 没有,3 次被分配了
<块引用>[1, 1, 1, 1, 3, 0, 0, 1] ->[2, 5, 0, 1]
第二个数组返回类别值.如果我们用 9s 替换输入数组中的两个 3s,分布看起来相同,但最后一个类别已经改变.
ori9 = [[1, 1, 1, 1, 9, 0, 0, 1],[1, 1, 1, 1, 9, 0, 0, 1],[1, 1, 1, 1, 2, 0, 0, 1]]
<块引用>
(array([[2, 5, 0, 1],[2, 5, 0, 1],[2, 5, 1, 0]]),数组([1, 2, ,3, 9])) <- 类别
aggregate_raters() 返回 ([data], [categories]) 的元组
在 [data] 中,行保持主题.Aggregate_raters() 将来自评估者的列转换为类别.Fleiss 期望表格"数据采用以下(主题、类别)格式:https://en.wikipedia.org/wiki/Fleiss'_kappa#Data
现在来解决问题:
如果我们将原始数据插入 Fleiss kappa 会发生什么:(我们只使用数据 'dats' 而不是类别列表 'cats')
dats,cats = irr.aggregate_raters(orig)irr.fleiss_kappa(dats, method='fleiss')
<块引用>
-0.12811059907834096
但是……为什么?好吧,看看 orig 数据 - aggregate_raters() 假设 raters 作为列!这意味着我们有完全分歧,例如在第一列和倒数第二列之间 - Fleiss 认为:第一位评分者总是评分为1";并且倒数第二个总是被评为0"->在所有三个主题上完全分歧.
所以我们需要做的是(抱歉我是菜鸟——可能不是最优雅的):
giro = np.array(orig).transpose()转帐
<块引用>
array([[1, 1, 1],[1, 1, 1],[1, 1, 1],[1, 1, 1],[3, 3, 2],[0, 0, 0],[0, 0, 0],[1, 1, 1]])
现在我们将主题作为行,将评分者作为列(三个评分者分配 4 个类别).如果我们将其插入到aggregate_raters() 函数并将结果数据提供给 fleiss 会发生什么?(使用索引 0 获取返回元组的第一部分)
irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')
<块引用>
0.8451612903225807
最后……这更有意义,如果除了主题 5 [3, 3, 2] 之外,所有三位评估者都完全同意.
克里彭多夫的阿尔法
当前的 krippendorff 实施需要 orig 格式的数据,其中评价者作为行和列作为主题 - 不需要聚合函数来准备数据.所以我可以看到这是一个更简单的解决方案.Fleiss 在医学研究中仍然非常流行,让我们看看它的比较:
将 krippendorff 导入为 kdkd.alpha(原版)
<块引用>
0.9359
哇……这比 Fleiss 的 kappa 高很多……好吧,我们需要告诉 Krippendorff Steven 对变量的测量水平.它必须是 'nominal'、'ordinal'、'interval'、'ratio' 或可调用的其中之一. - 这是 Krippendorff alpha 的差分函数".https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers
kd.alpha(orig, level_of_measurement='nominal')
<块引用>
0.8516
希望这会有所帮助,我在写这篇文章时学到了很多.
I have the following list of lists:
[[1, 1, 1, 1, 3, 0, 0, 1],
[1, 1, 1, 1, 3, 0, 0, 1],
[1, 1, 1, 1, 2, 0, 0, 1],
[1, 1, 0, 2, 3, 1, 0, 1]]
Where I want to calculate an inter-rater reliability score, there are multiple raters(rows). I cannot use Fleiss' kappa, since the rows do not sum to the same number. What is a good approach in this case?
Yes, data preparation is key here. Let's walk through it together.
While Krippendorff's alpha may be superior for any number of reasons, numpy and statsmodels provide everything you need to get Fleiss kappa from the above mentioned table. Fleiss' kappa is more prevalent in medical research despite Krippendorff alpha delivering mostly the same result if used correctly. If they deliver substantially different results this might be due to a number of user errors, most importantly format of input data and level of measurement (eg. ordinal vs. nominal) – skip ahead for the solution (transpose & aggregate): Fleiss kappa 0.845
pay close attention to which axis represents subject, rater or category !
Fleiss' kappa
statsmodels.stats import inter_rater as irr
The original data had raters as rows and subjects as columns with the integers representing the assigned categories (if I'm not mistaken).
I removed one row because there were 4 rows and 4 categories which may confuse the situation – so now we have 4 [0,1,2,3] categories and 3 rows.
orig = [[1, 1, 1, 1, 3, 0, 0, 1],
[1, 1, 1, 1, 3, 0, 0, 1],
[1, 1, 1, 1, 2, 0, 0, 1]]
From the documentation of the aggregate_raters() function
"convert raw data with shape (subject, rater) to (subject, cat_counts)"
irr.aggregate_raters(orig)
This returns:
(array([[2, 5, 0, 1], [2, 5, 0, 1], [2, 5, 1, 0]]), array([0, 1, 2, 3]))
now… the number of rows in the orig array is equal to the number of rows in the first of the returned arrays (3). The number of columns is now equal to the number of categories ([0,1,2,3] -> 4). The contents of each row add up to 8, which equals the number of columns in the orig input data – assuming every rater rated every subject. This aggregation shows how the raters are distributed across the categories (columns) for each subject (row). (If agreement was perfect on category 2 we would see [0,0,8,0]; or category 0 [8,0,0,0].
The function expects the rows to be subjects. See how the number of subjects has not changed (3 rows). And for each subject it counted how many times each category was assigned by 'looking' how many times the category (number) is found in the row. For the first row or category 0 was assigned twice, 1 five times, 2 none, 3 once
[1, 1, 1, 1, 3, 0, 0, 1] -> [2, 5, 0, 1]
The second array returns the category values. If we replace both 3s in the input array with 9s the distribution looks the same but the last category has changed.
ori9 = [[1, 1, 1, 1, 9, 0, 0, 1],
[1, 1, 1, 1, 9, 0, 0, 1],
[1, 1, 1, 1, 2, 0, 0, 1]]
(array([[2, 5, 0, 1], [2, 5, 0, 1], [2, 5, 1, 0]]), array([1, 2, ,3, 9])) <- categories
aggregate_raters() returns a tuple of ([data], [categories])
In the [data] the rows stay subjects. aggregate_raters() turns columns from raters into categories. Fleiss' expects the 'table' data to be in this (subject, category) format: https://en.wikipedia.org/wiki/Fleiss'_kappa#Data
Now to the solution of the problem:
What happens if we plug the original data into Fleiss kappa: (we just use the data 'dats' not the category list 'cats')
dats, cats = irr.aggregate_raters(orig)
irr.fleiss_kappa(dats, method='fleiss')
-0.12811059907834096
But... why? Well, look at the orig data – aggregate_raters() is assuming raters as columns ! This means that we have perfect disagreement e.g. between the first column and the second to last column – Fleiss thinks: "first rater always rated "1" and second to last always rated "0" -> perfect disagreement on all three subjects.
So what we need to do is (sorry I'm a noob – might not be the most elegant):
giro = np.array(orig).transpose()
giro
array([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 2], [0, 0, 0], [0, 0, 0], [1, 1, 1]])
Now we have subjects as rows and raters as columns (three raters assigning 4 categories). What happens if we plug this in into the aggregate_raters() function and feed the resulting data into fleiss ? (using index 0 to grab first part of returned tuple)
irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')
0.8451612903225807
Finally … this makes more sense, if all three raters agreed perfectly except on subject 5 [3, 3, 2].
Krippendorff's alpha
The current krippendorff implementation expects the data in the orig format with raters as rows and columns as subjects – no aggregation function needed to prepare the data. So I can see how this was the simpler solution. Fleiss is still very prevalent in medical research, so lets see how it compares:
import krippendorff as kd
kd.alpha(orig)
0.9359
Wow… that's a lot higher than Fleiss' kappa... Well, we need to tell Krippendorff the "Steven's level of measurement of the variable. It must be one of 'nominal', 'ordinal', 'interval', 'ratio' or a callable." – this is for the 'difference function' of Krippendorff's alpha. https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers
kd.alpha(orig, level_of_measurement='nominal')
0.8516
Hope this helps, I learned a lot writing this.
这篇关于多评价者数据的评价者间可靠性计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:多评价者数据的评价者间可靠性计算
基础教程推荐
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 筛选NumPy数组 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- 用于分类数据的跳跃记号标签 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01