How to unnest (explode) a column in a pandas DataFrame, into multiple rows(如何将 pandas DataFrame 中的列取消嵌套(分解)成多行)
问题描述
I have the following DataFrame where one of the columns is an object (list type cell):
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]
My expected output is:
A B
0 1 1
1 1 2
3 2 1
4 2 2
What should I do to achieve this?
Related question
pandas: When cell contents are lists, create a row for each element in the list
Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply
, which is not recommended, check more info When should I ever want to use pandas apply() in my code?)
I know object
dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.
I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.
Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode
function:
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
Given a dataframe with an empty list
or a NaN
in the column. An empty list will not cause an issue, but a NaN
will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1
apply + pd.Series
(easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex
at the end.
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3
recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4
using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6
using numpy
for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7
using base function itertools
cycle
and chain
: Pure python solution just for fun
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame
constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
Test Output
unnesting(df, ['B','C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
Update 2021-02-17 with original explode function
def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
这篇关于如何将 pandas DataFrame 中的列取消嵌套(分解)成多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何将 pandas DataFrame 中的列取消嵌套(分解)成多行
基础教程推荐
- 合并具有多索引的两个数据帧 2022-01-01
- 如何在 Python 中检测文件是否为二进制(非文本)文 2022-01-01
- Python 的 List 是如何实现的? 2022-01-01
- 使用 Google App Engine (Python) 将文件上传到 Google Cloud Storage 2022-01-01
- 症状类型错误:无法确定关系的真值 2022-01-01
- 如何在Python中绘制多元函数? 2022-01-01
- 使 Python 脚本在 Windows 上运行而不指定“.py";延期 2022-01-01
- 哪些 Python 包提供独立的事件系统? 2022-01-01
- 将 YAML 文件转换为 python dict 2022-01-01
- 使用Python匹配Stata加权xtil命令的确定方法? 2022-01-01