Redlink version of the solr.HyphenationCompoundWordTokenFilterFactory
with the fix for LUCENE-8183 and support for the epenthesis
parameter that allows to configure characters added between subwords in compound words.
The solr.HyphenationCompoundWordTokenFilterFactory
does not support the ignoreCase
parameter.
The typical workaround is to
- add the
solr.LowerCaseFilterFactory
before and - convert the dictionary to lower case
However, there are some cases where this workaround does not work as some other TokenFilter do change the case of tokens. One example is the solr.StemmerOverrideFilterFactory
when used with ignoreCase="true"
and an case sensitive dictionary. In those setting it is required to place the solr.LowerCaseFilterFactory
afterwards as otherwise one would risk having mixed case tokens in the token stream.
The ignoreCase="true"
option solves this issue as it allows this factory to work in a case-insensitive manner before the solr.LowerCaseFilterFactory
If the current part is not in the dictionary (and a dictionary is present) the original solr.HyphenationCompoundWordTokenFilterFactory
always tries to remove the last character to makes an additional dictionary lookup.
This version instead checks if current part end with a configured epenthesis. Only if this is the case the epenthesis is stripped of the part and an additional dictionary lookup is made.
This allows for a fine control over this functionality preventing unexpected matches in the dictionary.
To keep backward compatibility it no epenthesis
is configured the old behavior of stripping the last char is kept.
German
For German epenthesis are typically called 'Fügenlaute'. Based on 1 about 27% of all compound words to use a 'Fügenlaute' and about 15% of those do use '-[e]s' and an additional 9% '-[e]n'.
Based on this setting epenthesis="es,s,en,n"
but as en,n
typically also represents the plural and will therefore be in the dictionary it is sufficient to set epenthesis="es,s"
To improve results it is important to ensure that words including the 'Fügenlaut' s
are NOT in the dictionary as this will result that those will be added as tokens. Stemmers will NOT remove those s
at the end and typical queries searches will NOT match those.
To give an example: Assuming the Dictionary contains 'ausbildungs', 'ausbildung' and 'leiter' the word 'Ausbildungsleiter' will be decomposed to 'ausbildungs' and 'leiter'. If the stemmer does not remove the tailing 's' queries for 'Ausbildung' will not match the decomposed word.
A TokenFilter
that decomposes compound words found in many Germanic languages to find the primary word. Inspired by this description of Primary Word Detection.
"Donaudampfschiff" is decomposed to Donau, dampf, schiff. The primary word is expected to be the last part - 'schiff' in this example.
In case the configured dictionary contains 'dampfschiff' the primary word would be 'dampfschiff' as it is the longest match. But in case of onlyLongestMatch=false
both 'dampfschiff' and 'schiff would be added as tokens.
While this filter works without a dictionary it is highly recommended to provide one as otherwise results would be the last syllable of the token what is not very helpful in most situations
The factory accepts the following parameters:
hyphenator
(mandatory): path to the FOP xml hyphenation pattern. See offo.sourceforge.net/hyphenation/.encoding
(optional): encoding of the xml hyphenation file. defaults to UTF-8.dictionary
(optional, recommended): dictionary of words. defaults to no dictionary.minWordSize
(optional): minimal word length that gets decomposed. defaults to 5.minSubwordSize
(optional): minimum length of subwords. defaults to 2.maxSubwordSize
(optional): maximum length of subwords. defaults to 15.onlyLongestMatch
(optional): if true, only the longest matching word is added as primary word to the stream. defaults to true.
<fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="io.redlink.lucene.analysis.compound.PrimaryWordTokenFilterFactory"
hyphenator="hyphenator.xml" encoding="UTF-8" dictionary="dictionary.txt"
minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/>
</analyzer>
</fieldType>
The ResourceCache allows to share memory intensive resources between TokenFilter
. This is especially useful for TokenFilter
that use in-memory representations of dictionaries such as the HunspellStemFilter
, StemmOverrideFilter
or the CompoundWordFilter
With typical Solr configurations multiple instances with matching configuration are created. Typical examples are:
- index and query time Analyzer often use the same TokenFilter configuration
- different
TextField
may use the sameTokenFilter
configuration - all cores created for the same ConfigSet will instantiate the same TokenFilters (if shared schema is not enabled in the
solr.xml
) - Even different Cores and/or ConfigSets might use the same TokenFilter configurations
So in a Setting with two German Language Fields both using the same Hunspell stemmer configuration for both index and query time analyzers and 20 cores the Hunspell dictionary would be loaded 80 times in Memory!
The HunspellStemFilter
is the best example as it has the highest memory requirements. SOLR-3443 is related and describes exactly this problem.
The ResourceCache
provides a solution to this problem as it provides a framework that allows TokenFilter factories to get resources from a cache.
TokenFilterFactory need to be adapted to make use of the ResourceCache
. Because of that this module includes adapted versions of the HunspellSemmFilterFactory
and the StemmerOverwriteFilterFactory
. In addition, all the other FilterFactory implementations provided by this module do also support the ResourceCache
The ResourceCache
uses a singleton pattern. For the JVM this means one instance per Classloader
. Solr ResourceLoader
builds Solr Core specific Classloader for all resources from the Core/ConfigSet specific lib
folder.
Because of this if this modules jar
is provided via the cores lib
every Core will have its own ResourceCache
instance and resources will only be shared within a core. This will limit the functionality.
With the above Example the German Hunspell dictionary would be loaded 20 times instead of the 80 times with no ResourceCache
.
To share Resources with multiple cores one needs to provide the jar
in the sharedLib
folder (a configuration in the solr.xml
- As Lucene does not define a lifecycle for
TokenFilterFactory
components this cache can not useRefCount
. Instead, it usesWeakReference
to hold in Resources. So it is up to the garbage collector to decide when a Resource is no longer needed - The cache uses the Factory Pattern to avoid extending the ResourceLoader or adding a ResourceCacheAware callback.
ResourceType
definition provide type safety andResourceTypeLoader
implement the actual loading od resources (if not cached).- The
ResourceRef
has two responsibilities: First it is used as key for the cache, so it defines equivalence for Resources. Second it needs to provide all the information required by theResourceTypeLoader
to load the referenced resource - Currently, there is no way to enable/disable the
ResourceCache
. Using the Redlink versions of the TokenFilter factories will enable the usage of theResourceCache