Get term frequencies in Lucene(在 Lucene 中获取词频)
问题描述
有没有一种快速简便的方法从 Lucene 索引中获取词频,而无需通过 TermVectorFrequencies
类来完成,因为对于大型集合来说这需要大量时间?
Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies
class, since that takes an awful lot of time for large collections?
我的意思是,有没有像 TermEnum
这样的东西,它不仅有文档频率,还有词频?
What I mean is, is there something like TermEnum
which has not just the document frequency but term frequency as well?
更新:使用 TermDocs 太慢了.
UPDATE: Using TermDocs is way too slow.
推荐答案
使用TermDocs
获取给定文档的词频.与文档频率一样,您可以使用感兴趣的术语从 IndexReader
获取术语文档.
您不会找到比 TermDocs
更快的方法而不失一些通用性.TermDocs
直接从索引段中的.frq"文件中读取,其中每个术语频率按文档顺序列出.
You won't find a faster method than TermDocs
without losing some generality. TermDocs
reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.
如果这太慢",请确保您已优化索引以将多个段合并为一个段.按顺序遍历文档(跳过没问题,但不能高效地在文档列表中来回跳转).
If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).
您的下一步可能是进行额外处理,以创建一个更专业的文件结构,省略 SkipData
.就我个人而言,我会寻找更好的算法来实现我的目标,或者提供更好的硬件——大量内存,或者保存 RAMDirectory
,或者提供给操作系统以在其自己的文件缓存系统上使用.
Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData
. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory
, or to give to the OS for use on its own file-caching system.
这篇关于在 Lucene 中获取词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 Lucene 中获取词频


基础教程推荐
- 如何使用 Java 创建 X509 证书? 2022-01-01
- 在 Libgdx 中处理屏幕的正确方法 2022-01-01
- 减少 JVM 暂停时间 >1 秒使用 UseConcMarkSweepGC 2022-01-01
- “未找到匹配项"使用 matcher 的 group 方法时 2022-01-01
- Java:带有char数组的println给出乱码 2022-01-01
- 降序排序:Java Map 2022-01-01
- FirebaseListAdapter 不推送聊天应用程序的单个项目 - Firebase-Ui 3.1 2022-01-01
- Java Keytool 导入证书后出错,"keytool error: java.io.FileNotFoundException &拒绝访问" 2022-01-01
- 设置 bean 时出现 Nullpointerexception 2022-01-01
- 无法使用修饰符“public final"访问 java.util.Ha 2022-01-01