nGrams in apache mahout

https://stackoverflow.com/questions/8573882

bayesian
n-gram
mahout

22-03-2021
|

Pergunta

I am running the naive bayes classifier algorithm through apache mahout. We have the option to set up the gram size while training and running the algorithm's instance.

Changing my n-Gram size from 1 to 2, changes the resulting classification drastically. Why does this happen? How does n-Grams size make a drastic change in the result?

Solução

1-grams are words. 2-grams (or bigrams) are pairs of words. It's like classifying documents based on the existence of "United" and "States", or "United States". Using bigrams can have some space and performance implications, but probably will give better results than 1-grams.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow