Solr Snowball stemmer is inconsistent with Spanish
-
28-10-2019 - |
Frage
I have this stemmed field:
<fieldtype name="textes" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords-es.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish" protected="protwords-es.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish" protected="protwords-es.txt"/>
</analyzer>
</fieldtype>
The expected result of the search query alquileres
(rents) would be a match of alquiler
(rent). But when I go to "Field Analysis" in the Solr Admin site, and check an index value of alquiler
and a query value of alquileres
, the following happens:
- When indexing
alquiler
, it gets stemmed intoalquil
. - When querying
alquileres
, it gets stemmed intoalquiler
.
So the simple case of searching the plural form of a word (alquileres
) would not match its singular form (alquiler
).
Shouldn't both the index and the query be stemmed into the same stem (either alquiler
or alquil
)? Is this a limitation of the algorithm or a misunderstanding/misconfiguration from my part?
Lösung
Snowball stemming is very limited... You'd get better result by using a dictionary (Hunspell stemmer) : http://wiki.apache.org/solr/Hunspell
Andere Tipps
This link works properly for alquileres
I use hunspell from openoffice and it does an excelent job.
My example:
URL-Elastic/_analyze?analyzer=es_AR&text=alquileres
And return:
{
tokens:
[
{
token: "alquiler",
start_offset: 0,
end_offset: 10,
type: "<ALPHANUM>",
position: 1
}
]
}