Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加单字成词的属性 #1075

Open
gwisdomroof opened this issue Aug 28, 2024 · 1 comment
Open

增加单字成词的属性 #1075

gwisdomroof opened this issue Aug 28, 2024 · 1 comment

Comments

@gwisdomroof
Copy link

gwisdomroof commented Aug 28, 2024

Problem description

ik_smart/ik_max_word在分词时,如果命中了词语,则不会进一步拆分单字,例如:“唐诗三百首”,采用ik_max_word分词结果为:["唐诗三百首", "唐诗三百", "唐诗", "三百", "首"]
这样的问题在于:如果我只是输入“诗”字,则这篇文档不会被命中。

Preferred solution

  1. ik_smart/ik_max_word新增一个属性splitWord2Char,如果为true,就会将词语拆分成单字。默认为false,以便跟现有的行为一致。
    对ik_max_word而言,“唐诗三百首”的拆分结果将是:["唐诗三百首", "唐诗三百", "唐诗", "三百", "唐", "诗", "三", "百", "首"]

  2. 新增一个ik_char的分词器,将文本拆分成单字。“唐诗三百首”的拆分结果将是:["唐", "诗", "三", "百", "首"]
    这个分词器的目的,主要是解决es本身不能处理Surrogate Pair的宽字节字符,而ik分词器目前能处理,而且对中文分词有很好的支持。

@gwisdomroof
Copy link
Author

#854
这个pr提供了一个ik_max_word_char的分词器,实际就是ik_max_word的splitWord2Char为true的情况。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant