[SRILM User List] class based model
Andreas Stolcke
stolcke at icsi.berkeley.edu
Mon Jan 6 11:31:42 PST 2014
On 1/6/2014 7:45 AM, DUGAST Loic wrote:
> Hi
>
> In the FAQ
> (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html)
>
> You advise to ...
>
> c)
> Lower the minimum counts for N-grams included in the LM, i.e., the
> values of the options *-gt2min*, *-gt3min*, *-gt4min*, etc. The
> higher order N-grams typically get higher minimum counts.
>
>
> Do you not mean : *rise* the minimum counts (...) instead ?
>
You are correct. It should say raise the min counts. We'll fix the
documentation ASAP.
> Plus I am not sure to understand why gt2min should be set higher than
> gt1min etc ?
> Higher-order ngrams are naturally less frequent. Therefore the same
> cutoff value (gt2min equal to gt1min)will be harsher to bigrams than
> to unigrams... Can you explain ?
The minimum counts are a crude way to trade off performance for space,
and since there are lot more long ngrams than short ngrams you get more
space savings with higher order ngrams. It is typically not worth it to
eliminate unigrams and bigrams, but a decent tradeoff to remove
singleton trigrams and fourgrams. The default values were chose based
on historical practice (I think they might have even been inherited from
the CMU LM toolkit).
The better and more principled way to remove ngrams is entropy-based
pruning (ngram/ngram-count -prune option). So the best strategy given
limited memory is to make the gtmin values as low are you can afford to
fit into memory, then use -prune (you can do this in the same invocation
of ngram-count or make-big-lm).
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140106/082e64e6/attachment.html>
More information about the SRILM-User
mailing list