OOV words

Dmitriy Dligach Dmitriy.Dligach at colorado.edu
Tue Apr 7 08:30:53 PDT 2009


Hello,

First of all I wanted to thank the creators of SRILM -- I find this  
tool extremely useful in my research.

Second, I have a question about out-of-vocabulary (OOV) words. I train  
a language model on a collection of english news wire text:

ngram-count -text all.txt -lm all.lm -order 5

and then compute probabilities:

ngram -lm all.lm -ppl test.txt -debug 1

There happen to be some sentences in foreign languages in my test.txt  
file. I'd expect them to receive very low probabilities because the  
model was trained on strictly english text. However, instead they  
receive very high probabilities.

Could this have something to do with the way SRILM handles OOV words?

Dima




More information about the SRILM-User mailing list