沃梦达 / 编程问答 / php问题 / 正文


Stemming algorithm that produces real words(产生真实单词的词干算法)




I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

我使用了 Porter Stemmer 算法的实现(顺便说一下,我是用 PHP 编写的):

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):



This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

我尝试过Snowball"(在另一个 Stack Overflow 线程中建议).

I've tried "Snowball" (suggested within another Stack Overflow thread).


就我的示例(社区/社区)而言,Snowball 源于communiti".

For my example (community / communities), Snowball stems to "communiti".



Are there any other stemming algorithms that will do this? Has anyone else solved this problem?




The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. 找到或创建一个大词典,将每个可能的词干映射回实际单词.(例如,社区 -> 社区)
  2. 创建一个函数,将每个词干与缩减为该词干的单词列表进行比较,并尝试确定哪个词最相似.(例如,将communiti"与community"和communities"进行比较,以便将community"视为更相似的选项)

就我个人而言,我认为我的做法是#1 的动态形式,通过记录检查的每个单词及其词干,然后假设最常见的单词是一个,从而建立自定义词典数据库应该使用.(例如,如果我的源文本正文比社区"更频繁地使用社区",则映射社区 -> 社区.)基于字典的方法通常会更准确,并且基于词干分析器输入构建它会提供结果根据您的文本进行定制,主要缺点是需要空间,现在这通常不是问题.

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.


