question about SRILM
Andreas Stolcke
stolcke at speech.sri.com
Fri Mar 12 09:03:09 PST 2004
In message <4051DAA8.5080700 at irisa.fr>you wrote:
> Hi.
> I have one question about SRILM. I don't understand how is computed the
> log-probability of an unigram.
> Isn't it log[P(w)] = log[c(w)] - log[|V|], where c(w) is the frequency
> of the word w in the training set and |V| the size of the vocabulary ?
> And, if this formula is used, are the tokens <s> and </s> considered to
> be part of the vocabulary or not (i.e. are they counted in |V| ?) ?
>
> Thank you for answering.
> Solen Quiniou.
>
The formula for unigram probabilities (modulo smoothing) is
log[P(w)] = log[c(w)] - log[N]
where N is the number of word TOKENS in the training corpus (not the
vocabulary).
End-of-sentence tags are included in the count, since they are among the
events that are predicted by the LM, but Beginning-of-sentence is not.
You will notice that the log probabilty of <s> is set to -99 (a
stand-in for minus infinity).
--Andreas
PS. Please send your questions to "srilm-user at speech.sri.com" in the
future.
More information about the SRILM-User
mailing list