How to sum in pandas by unique index in several columns?(如何通过几列中的唯一索引对 pandas 求和?)
问题描述
我有一个 pandas DataFrame,它详细说明了用户会话期间的点击"方面的在线活动.有多达 50,000 个独立用户,数据框有大约 150 万个样本.显然大多数用户都有多条记录.
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
四列是唯一的用户id,用户开始服务Registration"的日期,用户使用服务Session"的日期,总点击次数.
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
dataframe的组织结构如下:
The organization of the dataframe is as follows:
User_ID Registration Session clicks
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
...
(上面还有一个以0开头的索引,但可以将User_ID
设置为索引.)
(There is also an index above beginning with 0, but one could set User_ID
as the index.)
我想汇总用户自注册日期以来的总点击次数.数据框(或 pandas Series 对象)将列出 User_ID 和Total_Number_Clicks".
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID Total_Clicks
2349876 722
1987293 341
2234214 220
9874452 1405
...
如何在 pandas 中做到这一点?这是由 .agg()
完成的吗?每个 User_ID
都需要单独求和.
How does one do this in pandas? Is this done by .agg()
? Each User_ID
needs to be summed individually.
由于有 150 万条记录,这是否可以扩展?
As there are 1.5 million records, does this scale?
推荐答案
IIUC你可以使用groupby
, sum
和 reset_index
:
IIUC you can use groupby
, sum
and reset_index
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df.groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
如果第一列User_ID
是index
:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.groupby(level=0)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
或者:
print df.groupby(df.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
正如 Alexander 所指出的,您需要在 groupby
之前过滤数据,如果 Session
日期少于每个 User_ID
的 Registration
日期:
As Alexander pointed, you need filter data before groupby
, if Session
dates is less as Registration
dates per User_ID
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
我更改了 3. 行数据以获得更好的样本:
I change 3. row of data for better sample:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2012-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.Session >= df.Registration
User_ID
2349876 True
1987293 True
2234214 False
9874452 True
dtype: bool
print df[df.Session >= df.Registration]
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
9874452 2010-12-22 2014-08-22 2
df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2349876 2
2 9874452 2
这篇关于如何通过几列中的唯一索引对 pandas 求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何通过几列中的唯一索引对 pandas 求和?
基础教程推荐
- 如何在 Python 中检测文件是否为二进制(非文本)文 2022-01-01
- 将 YAML 文件转换为 python dict 2022-01-01
- Python 的 List 是如何实现的? 2022-01-01
- 使 Python 脚本在 Windows 上运行而不指定“.py";延期 2022-01-01
- 如何在Python中绘制多元函数? 2022-01-01
- 症状类型错误:无法确定关系的真值 2022-01-01
- 使用Python匹配Stata加权xtil命令的确定方法? 2022-01-01
- 使用 Google App Engine (Python) 将文件上传到 Google Cloud Storage 2022-01-01
- 哪些 Python 包提供独立的事件系统? 2022-01-01
- 合并具有多索引的两个数据帧 2022-01-01