分层数据:有效地为每个节点构建一个每个后代的-Python问题

Hierarchical data: efficiently build a list of every descendant for each node(分层数据:有效地为每个节点构建一个每个后代的列表)

本文介绍了分层数据:有效地为每个节点构建一个每个后代的列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个两列数据集，描述了形成一棵大树的多个子-父关系.我想用它来为每个节点构建每个后代的更新列表.

I have a two column data set depicting multiple child-parent relationships that form a large tree. I would like to use this to build an updated list of every descendant for each node.

   child  parent
1   2010    1000
7   2100    1000
5   2110    1000
3   3000    2110
2   3011    2010
4   3033    2100
0   3102    2010
6   3111    2110

关系的图形描述:

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

最初我决定对 DataFrames 使用递归解决方案.它按预期工作，但 Pandas 效率极低.我的研究使我相信，使用 NumPy 数组(或其他简单数据结构)的实现在大型数据集(数十万条记录)上会快得多.

Originally I decided to use a recursive solution with DataFrames. It works as intended, but Pandas is awfully inefficient. My research has led me to believe that an implementation using NumPy arrays (or other simple data structures) would be much faster on large data sets (of 10's of thousands of records).

import pandas as pd

df = pd.DataFrame(
    {
        'child':     [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent':    [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    },  columns=['child', 'parent']
)


def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(df[df['parent'] == parent_id]['child'].values)

        for i, r in df[df['parent'] == parent_id].iterrows():
            if r['child'] != parent_id:
                list_of_children.append(get_child_list(r['child']))

        # flatten list
        list_of_children = [item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
        temp_df['ancestor'] = row['parent']
        new_df = new_df.append(temp_df)

    new_df = new_df
        .drop_duplicates()
        .sort_values(['ancestor', 'descendant'])
        .reset_index(drop=True)

    return new_df

因为以这种方式使用pandas DataFrames在大数据集上效率很低，我需要提高这个操作的性能.我的理解是可以通过使用更高效的数据结构更好地做到这一点适用于循环和递归.我想以最有效的方式执行相同的操作.

Because using pandas DataFrames in this way is very inefficient on large data sets, I need to improve the performance of this operation. My understanding is that this can be done by using more efficient data structures better suited for looping and recursion. I want to perform this same operation in the most efficient way possible.

具体来说，我要求优化速度.

推荐答案

这是一种使用 numpy 一次向下迭代树的方法.

This is a method using numpy to iterate down the tree a generation at a time.

import numpy as np
import pandas as pd  # only used to return a dataframe


def list_ancestors(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Return pandas dataframe of all descendant/ancestor node pairs

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        list_ancestors(df.values)

        returns

            descendant  ancestor
        0          200       100
        1          201       100
        2          300       200
        3          300       100
        4          301       200
        5          301       100
        6          302       201
        7          302       100
        8          400       300
        9          400       200
        10         400       100
    """
    ancestors = []
    for ar in trace_nodes(edges):
        ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
                               ar[:, 1:].flatten()])
    return pd.DataFrame(np.concatenate(ancestors),
                        columns=['descendant', 'ancestor'])


def trace_nodes(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Yield numpy array with cross-section of tree and associated
    ancestor nodes

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        trace_nodes(df.values)

        yields

        array([[200, 100],
               [201, 100]])

        array([[300, 200, 100],
               [301, 200, 100],
               [302, 201, 100]])

        array([[400, 300, 200, 100]])
    """
    mask = np.in1d(edges[:, 1], edges[:, 0])
    gen_branches = edges[~mask]
    edges = edges[mask]
    yield gen_branches
    while edges.size != 0:
        mask = np.in1d(edges[:, 1], edges[:, 0])
        next_gen = edges[~mask]
        gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
        edges = edges[mask]
        yield gen_branches


def numpy_col_inner_many_to_one_join(ar1, ar2):
    """
    Take two 2-d numpy arrays ar1 and ar2,
    with no duplicate values in first column of ar2
    Return inner join of ar1 and ar2 on
    last column of ar1, first column of ar2

    Ex:

        ar1 = np.array([[1,  2,  3],
                        [4,  5,  3],
                        [6,  7,  8],
                        [9, 10, 11]])

        ar2 = np.array([[ 1,  2],
                        [ 3,  4],
                        [ 5,  6],
                        [ 7,  8],
                        [ 9, 10],
                        [11, 12]])

        numpy_col_inner_many_to_one_join(ar1, ar2)

        returns

        array([[ 1,  2,  3,  4],
               [ 4,  5,  3,  4],
               [ 9, 10, 11, 12]])
    """
    ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
    ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
    if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
        bins = np.bincount(ar1[:, -1])
        counts = bins[bins.nonzero()[0]]
    else:
        counts = np.unique(ar1[:, -1], False, False, True)[1]
    left = ar1[ar1[:, -1].argsort()]
    right = ar2[ar2[:, 0].argsort()]
    return np.concatenate([left[:, :-1],
                           right[np.repeat(np.arange(right.shape[0]),
                                           counts)]], 1)

时序比较:

测试用例 1 &2 由@taky2 提供，测试用例 3 &4 分别比较高树结构和宽树结构的性能——大多数用例可能介于中间.

Timing Comparison:

Test cases 1 & 2 provided by @taky2, test cases 3 & 4 comparing performance on tall and wide tree structures respectively – most use cases are likely somewhere in the middle.

df = pd.DataFrame(
    {
        'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    }
)

df2 = pd.DataFrame(
    {
        'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
                  3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
        'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
                   2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
    }
)

df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
                         np.c_[np.arange(501, 1001), np.arange(500)]],
                   columns=['child', 'parent'])

df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
                         np.c_[np.arange(1001, 11001),
                               np.repeat(np.arange(1, 101), 100)]],
                   columns=['child', 'parent'])

%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop

%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop

%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 µs per loop

%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 µs per loop

%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop

%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop

%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop

%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 µs per loop

%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop

%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop

%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop

%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop

%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop

%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop

注意事项:

get_ancestry_dataframe_flat 未在案例 3 上计时 &4 由于时间和内存问题.

get_ancestry_dataframe_flat not timed on cases 3 & 4 due to time and memory concerns.

add_children_of_children 修改为在内部识别根节点，但允许假设一个唯一的根.第一行 root_node = (set(dataframe.parent) - set(dataframe.child)).pop() 添加.

add_children_of_children modified to identify root node internally, but allowed to assume a unique root. First line root_node = (set(dataframe.parent) - set(dataframe.child)).pop() added.

all_descendants_nx 修改为接受数据帧作为参数，而不是从外部命名空间中提取.

all_descendants_nx modified to accept a dataframe as an argument, instead of pulling from an external namespace.

展示正确行为的示例:

np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])
                                       .reset_index(drop=True) ==
       list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])
                                 .reset_index(drop=True))
Out[20]: True

这篇关于分层数据:有效地为每个节点构建一个每个后代的列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！