ngram manipulation

Andreas Stolcke stolcke at speech.sri.com
Thu Mar 8 08:21:42 PST 2007


There is a hack to do it.
Remove from your LM any ngrams involving the <s> or </s> token (without
changing the other probabilities nad backoff weights).
Then feed your ngrams to "ngram -debug 1 -ppl").  The "sentence"
log probabilities will now correspond to joint ngram probabilities,
since the initial word will back off to a unigram probability, and 
the final </s> will count as an OOV and not contrinute to the total
log probability.

It would be easy to add an option somewhere to make this more convenient,
without the need to hack the LM itself.

--Andreas

In message <45F01ED0.2030305 at idiap.ch>you wrote:
> Hello SRILM users,
> 
> I have a question on the use of srilm toolkit for LM manipulation.
> 
> The language model in the arpa format gives conditional probabilities
> e.g  p(wd3|wd1, wd2)
> Can I compute the joint probability p(wd1, wd2, wd3)  using any utility.
> 
> I have a heavy LM with (ngram 1=50002, ngram 2=29077135, ngram 3=40083381).
> 
> 
> Any help would be greatly appreciated.
> Thanks,
> joel.
> 
> 
> arpa format:
> p(wd3|wd1,wd2) = if(trigram exists)           p_3(wd1,wd2,wd3)
>                 else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
>                 else                         p(wd3|w2)
> 
> p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
>             else              bo_wt_1(wd1)*p_1(wd2)
> 




More information about the SRILM-User mailing list