Pandas corr() returning NaN too often(Pandas corr() 经常返回 NaN)
问题描述
我试图在数据帧上运行我认为应该是一个简单的相关函数,但它在我认为不应该的地方返回 NaN.
I'm attempting to run what I think should be a simple correlation function on a dataframe but it is returning NaN in places where I don't believe it should.
代码:
# setup
import pandas as pd
import io
csv = io.StringIO(u'''
id date num
A 2018-08-01 99
A 2018-08-02 50
A 2018-08-03 100
A 2018-08-04 100
A 2018-08-05 100
B 2018-07-31 500
B 2018-08-01 100
B 2018-08-02 100
B 2018-08-03 0
B 2018-08-05 100
B 2018-08-06 500
B 2018-08-07 500
B 2018-08-08 100
C 2018-08-01 100
C 2018-08-02 50
C 2018-08-03 100
C 2018-08-06 300
''')
df = pd.read_csv(csv, sep = ' ')
# Format manipulation
df = df[df['num'] > 50]
df = df.pivot(index = 'date', columns = 'id', values = 'num')
df = pd.DataFrame(df.to_records())
# Main correlation calculations
print df.iloc[:, 1:].corr()
主题数据帧:
A B C
0 NaN 500.0 NaN
1 99.0 100.0 100.0
2 NaN 100.0 NaN
3 100.0 NaN 100.0
4 100.0 NaN NaN
5 100.0 100.0 NaN
6 NaN 500.0 300.0
7 NaN 500.0 NaN
8 NaN 100.0 NaN
corr() 结果:
A B C
A 1.0 NaN NaN
B NaN 1.0 1.0
C NaN 1.0 1.0
根据(有限的)文档 在函数上,它应该排除NA/空值".既然每一列都有重叠的值,那么结果不应该都是非NaN吗?
According to the (limited) documentation on the function, it should exclude "NA/null values". Since there are overlapping values for each column, should the result not all be non-NaN?
这里有很好的讨论和此处,但都没有回答我的问题.我已经尝试了 float64
的想法,讨论了 here,但是那也失败了.
There are good discussions here and here, but neither answered my question. I've tried the float64
idea discussed here, but that failed as well.
@hellpanderr 的评论提出了一个很好的观点,我使用的是 0.22.0
@hellpanderr's comment brought up a good point, I'm using 0.22.0
额外问题 - 我不是数学家,但在这个结果中 B 和 C 之间如何存在 1:1 的相关性?
推荐答案
结果似乎是您使用的数据的人工制品.在你写的时候,NA
s 被忽略了,所以它基本上归结为:
The result seems to be an artefact of the data you work with. As you write, NA
s are ignored, so it basically boils down to:
df[['B', 'C']].dropna()
B C
1 100.0 100.0
6 500.0 300.0
因此,每列只剩下两个值用于计算,因此应该 导致1
的相关系数:
So, there are only two values per column left for the calculation which should therefore lead to to correlation coefficients of 1
:
df[['B', 'C']].dropna().corr()
B C
B 1.0 1.0
C 1.0 1.0
那么,对于剩余的组合,NA
是从哪里来的?
So, where do the NA
s then come from for the remaining combinations?
df[['A', 'B']].dropna()
A B
1 99.0 100.0
5 100.0 100.0
df[['A', 'C']].dropna()
A C
1 99.0 100.0
3 100.0 100.0
因此,在这里您最终每列只有两个值.不同之处在于 B
和 C
列只包含一个值 (100
),它给出了 0
:
So, also here you end up with only two values per column. The difference is that the columns B
and C
contain only one value (100
) which gives a standard deviation of 0
:
df[['A', 'C']].dropna().std()
A 0.707107
C 0.000000
计算相关系数时,除以标准差,得到NA
.
When the correlation coefficient is calculated, you divide by the standard deviation, which leads to a NA
.
这篇关于Pandas corr() 经常返回 NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Pandas corr() 经常返回 NaN
基础教程推荐
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01
- 筛选NumPy数组 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- 用于分类数据的跳跃记号标签 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01