Iterate over 36 million items in a list of tuples in python efficiently and faster(在 python 的元组列表中有效且更快地迭代超过 3600 万个项目)
问题描述
首先,在有人将其标记为重复之前,请阅读以下内容.我不确定迭代中的延迟是由于庞大的规模还是我的逻辑.我有一个用例,我必须在元组列表中迭代 3600 万个项目.我的主要要求是速度和效率.样品清单:
Firstly, before anyone marks it as a duplicate, please read below. I am unsure if the delay in the iteration is due to the huge size or my logic. I have a use case where I have to iterate over 36 million items in a list of tuples. My main requirement is speed and efficiency. Sample list:
[
('how are you', 'I am fine'),
('how are you', 'I am not fine'),
...36 million items...
]
到目前为止我做了什么:
What I have done so far:
for query_question in combined:
query = "{}".format(word_tokenize(query_question[0]))
question = "{}".format(word_tokenize(query_question[1]))
# the function uses a naive doc2vec extension of GLOVE word vectors
vec1 = np.mean([
word_vector_dict[word]
for word in literal_eval(query)
if word in word_vector_dict
], axis=0)
vec2 = np.mean([
word_vector_dict[word]
for word in literal_eval(question)
if word in word_vector_dict
], axis=0)
similarity_score = 1 - distance.cosine(vec1, vec2)
store_question_score = store_question_score.append(
(query_question[1], similarity_score)
)
count += 1
if(count == len(data_list)):
store_question_score_descending = store_question_score.sort(
key=itemgetter(1), reverse=True
)
result_dict[query_question[0]] = store_question_score_descending[:5]
store_question_score =[]
count = 1
上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法.我怀疑迭代中的延迟可能是 vec1 和 vec2
的计算. 如果是这样,我怎样才能做得更好?我正在寻找如何加快这个过程.
The above logic aims to calculate the similarity scores between questions and perform a text similarity algorithm. I'm suspecting the delay in the iteration could be the calculation of vec1 and vec2
. If so, how can I do this better? I am looking for how to speed up the process.
还有很多其他问题类似于迭代巨大列表,但我找不到任何可以解决我的问题的问题.
There are plenty of other questions similar to iterative over huge lists, but I could not find any that solved my problem.
非常感谢您提供的任何帮助.
I really appreciate any help you can provide.
推荐答案
尝试缓存:
from functools import lru_cache
@lru_cache(maxsize=None)
def compute_vector(s):
return np.mean([
word_vector_dict[word]
for word in literal_eval(s)
if word in word_vector_dict
], axis=0)
然后改用这个:
vec1 = compute_vector(query)
vec2 = compute_vector(question)
如果向量的大小是固定的,您可以通过缓存到形状为 (num_unique_keys, len(vec1))
的 numpy 数组做得更好,在您的情况下 num_unique_keys =370000 + 100
:
If the size of the vectors is fixed, you can do even better by caching to a numpy array of shape (num_unique_keys, len(vec1))
, where in your case num_unique_keys = 370000 + 100
:
class VectorCache:
def __init__(self, func, num_keys, item_size):
self.func = func
self.cache = np.empty((num_keys, item_size), dtype=float)
self.keys = {}
def __getitem__(self, key):
if key in self.keys
return self.cache[self.keys[key]]
self.keys[key] = len(self.keys)
item = self.func(key)
self.cache[self.keys[key]] = item
return item
def compute_vector(s):
return np.mean([
word_vector_dict[word]
for word in literal_eval(s)
if word in word_vector_dict
], axis=0)
vector_cache = VectorCache(compute_vector, num_keys, item_size)
然后:
vec1 = vector_cache[query]
vec2 = vector_cache[question]
使用类似的技术,您还可以缓存余弦距离:
Using a similar technique, you can also cache the cosine distances:
@lru_cache(maxsize=None)
def cosine_distance(query, question):
return distance.cosine(vector_cache[query], vector_cache[question])
这篇关于在 python 的元组列表中有效且更快地迭代超过 3600 万个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 python 的元组列表中有效且更快地迭代超过 3600 万个项目
基础教程推荐
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 筛选NumPy数组 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- 用于分类数据的跳跃记号标签 2022-01-01