在 python 中有效地处理一个大的 .txt 文件-Python问题

Processing a large .txt file in python efficiently(在 python 中有效地处理一个大的 .txt 文件)

本文介绍了在 python 中有效地处理一个大的 .txt 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 python 和一般编程很陌生，但我正在尝试对包含大约 700 万行 python 的制表符分隔的 .txt 文件运行滑动窗口"计算.我所说的滑动窗口的意思是，它将运行一个计算，比如 50,000 行，报告数字，然后向上移动说 10,000 行，并在另外 50,000 行上执行相同的计算.我的计算和滑动窗口"工作正常，如果我在一小部分数据上测试它，它运行良好.但是，如果我尝试在我的整个数据集上运行该程序，它会非常慢(我现在已经运行了大约 40 个小时).数学很简单，所以我认为它不应该花这么长时间.

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

我现在阅读 .txt 文件的方式是使用 csv.DictReader 模块.我的代码如下:

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('','') for line in newfile), delimiter="	")

我相信这是一次从所有 700 万行中制作一本字典，我认为这可能是它对于较大文件的速度如此之慢的原因.

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

由于我只对一次对块"或窗口"数据运行计算感兴趣，有没有更有效的方法来一次只读取指定的行，执行计算，然后重复指定行的新指定块"或窗口"?

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

推荐答案

collections.deque 是一个有序的项目集合，可以采用最大大小.当您将一个项目添加到一端时，一个项目会从另一端落下.这意味着要遍历 csv 上的窗口"，您只需要继续向 deque 添加行，它就会处理丢弃完整的行.

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("", "") for line in csv_file), delimiter="	")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)

这篇关于在 python 中有效地处理一个大的 .txt 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！