question about SRILM

Fri Mar 12 09:03:09 PST 2004

In message <4051DAA8.5080700 at irisa.fr>you wrote:
> Hi.
> I have one question about SRILM. I don't understand how is computed the 
> log-probability of an unigram.
> Isn't it log[P(w)] = log[c(w)] - log[|V|], where c(w) is the frequency 
> of the word w in the training set and |V| the size of the vocabulary ?
> And, if this formula is used, are the tokens <s> and </s> considered to 
> be part of the vocabulary or not (i.e. are they counted in |V| ?) ?
> 
> Thank you for answering.
> Solen Quiniou.
> 

The formula for unigram probabilities (modulo smoothing) is 

	log[P(w)] = log[c(w)] - log[N]

where N is the number of word TOKENS in the training corpus (not the 
vocabulary).

End-of-sentence tags are included in the count, since they are among the
events that are predicted by the LM, but Beginning-of-sentence is not.
You will notice that the log probabilty of <s> is set to -99 (a
stand-in for minus infinity).

--Andreas 

PS. Please send your questions to "srilm-user at speech.sri.com" in the 
future.