-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s-stemmer deviates from paper? #157
Comments
Hello @markharwood. That's entirely possible because I think I wrote my implementation reading Lucene's one, which should be the same as ES is using. Do you, by chance, have a link to, or the pdf, of the original article? As stated here I only could find a paper referencing the algorithm and explaining its broad intentions. |
No, I only saw the same paper as you. I've just tried sending an email to the original paper author - I'm sure she'd like to see her algorithm implemented correctly too. |
I heard back from Donna, the paper author. She agrees the bees/employees words should fall into rule 3 and remove the S. However that logic would make rule 2 redundant. The origins of the S-stemmer algorithm appear to be lost in time - Donna didn't author it and suggested the logic may be connected to the SMART system from wayback when. Rather than trying to resolve that I've been working on an alternative plural stemmer for elasticsearch here |
Cool. Can you tell me when you feel your stemmer is done and when it's merged into ES and I will be able to replicate here if you want. Or feel free to open a PR if you want to do it also. |
I see that
bees
doesn't stem tobee
andtomatoes
doesn't stem totomato
.Is this misinterpreting the logic in the original paper?
I ask because I work on elasticsearch and discovered that we have a similar issue. See elastic/elasticsearch#42892 (comment) for my notes on the confusion.
The text was updated successfully, but these errors were encountered: