How to tokenize only certain words in Lucene(如何在 Lucene 中仅标记某些单词)
问题描述
我在我的项目中使用 Lucene,我需要一个自定义分析器.
I'm using Lucene for my project and I need a custom Analyzer.
代码是:
public class MyCommentAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
TokenStream filter = new StandardFilter( Version.LUCENE_48, source );
filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );
return new TokenStreamComponents( source, filter );
}
}
我已经建立了它,但现在我无法继续.我的需求是过滤器必须只选择某些单词.与使用停用词相比,就像一个相反的过程:不要从词表中删除,而只添加词表中的术语.就像一个预建的字典.所以 StopFilter 不会填充目标.Lucene 提供的过滤器似乎都不是很好.我想我需要编写自己的过滤器,但不知道如何.
I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.
有什么建议吗?
推荐答案
你可以从 StopFilter
开始,所以 阅读源代码!
You're right to look to StopFilter
for a starting point, so read the source!
StopFilter
的大部分源代码都是用于构建 stopset 的便捷方法.您可以放心地忽略所有这些(除非您想保留它以构建您的保留集).
Most of StopFilter
's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).
去掉所有这些,StopFilter
归结为:
Cut all that, and StopFilter
boils down to:
public final class StopFilter extends FilteringTokenFilter {
private final CharArraySet stopWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
super(matchVersion, in);
this.stopWords = stopWords;
}
@Override
protected boolean accept() {
return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
FilteringTokenFilter
是一个很容易实现的类.关键就是 accept
方法.当为当前术语调用它时,如果它返回 true,则将该术语添加到输出流中.如果返回 false,则丢弃当前术语.
FilteringTokenFilter
is a pretty simple class to implement. The key is just the accept
method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.
所以您真正需要在 StopFilter
中更改的唯一一件事就是删除单个字符,以使 accept
返回与什么相反的内容目前确实如此.在这里和那里改几个名字也没什么坏处.
So the only thing you really need to change in StopFilter
is to delete a single character, to make accept
return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.
public final class KeepOnlyFilter extends FilteringTokenFilter {
private final CharArraySet keepWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
super(matchVersion, in);
this.keepWords = keepWords;
}
@Override
protected boolean accept() {
return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
这篇关于如何在 Lucene 中仅标记某些单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何在 Lucene 中仅标记某些单词
基础教程推荐
- 减少 JVM 暂停时间 >1 秒使用 UseConcMarkSweepGC 2022-01-01
- 设置 bean 时出现 Nullpointerexception 2022-01-01
- 如何使用 Java 创建 X509 证书? 2022-01-01
- 无法使用修饰符“public final"访问 java.util.Ha 2022-01-01
- 在 Libgdx 中处理屏幕的正确方法 2022-01-01
- “未找到匹配项"使用 matcher 的 group 方法时 2022-01-01
- 降序排序:Java Map 2022-01-01
- Java:带有char数组的println给出乱码 2022-01-01
- FirebaseListAdapter 不推送聊天应用程序的单个项目 - Firebase-Ui 3.1 2022-01-01
- Java Keytool 导入证书后出错,"keytool error: java.io.FileNotFoundException &拒绝访问" 2022-01-01