[SRILM User List] How does the option "-gtmin" work in ngram-count?
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Apr 10 16:41:15 PDT 2012
On 4/10/2012 12:21 AM, bulusheva wrote:
> Hi, I have two questions:
>
> 1. If I generate the language model with Kneser-Ney smoothing (or
> Modified Kneser-Ney), why do the parameter "-gtnmin" apply to already
> modified counts?
>
> For example, if in the training data 2-gram "markov model" occurs
> only in the context "hidden markov model" and gt2min = 2, then the
> modified count for "markov model" = n(* markov model) = 1 < gt2min
> and
> prob("markov model") = bow("markov")*prob("model").
> Instead of prob("markov model") = ( n(* markov model) - D)/ n(*
> markov *) ;
>
That's how it is currently implemented. It is debatable how the
minimum count should be applied in the case of the lower-order
distributions in KN models.
The way it currently works is natural from an implementation
perspective, because the lower-order counts are physically modified
before applying the discounting (you can examine them by adding -write
COUNTS).
But you are raising a good point. It might make more sense to have the
-gtXmin values be interpreted independent of the discounting method.
>
> 2. Let say I use ngram-count to generate the language model as
> following:
> ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm sri.lm
> Let the word "hello" exists in "vocab.txt" and occurs 4 times in
> "text.txt". Then probability of "hello" is calculated as
> probability of zerotone. Is it correct?
>
That is correct, but the ARPA format doesn't allow you to prune
unigrams, so the unigrams will always appear explicitly listed in the
LM, even if their probabilities might be obtained by backing off to a
uniform distribution.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120410/c48d596f/attachment.html>
More information about the SRILM-User
mailing list