Ask about the practical usage of SRILM for Machine Translation

Thu Aug 3 05:06:37 PDT 2006

Hi every one

This question is for SRILM - 1.4.1

I am working on Statistical Machine Translation, basically the problem 
is to find the best sentence e (english) given the input sentence f 
(foreign)
e = argmax p(e|f) = argmax p(f|e).p(e).
In which, the p(f|e) is about the translation model (including the 
lexicon and alignment models)

What I am concerning about is p(e), the language model.

My corpus is EuroParl (European Parliament Sessions), now I'm working 
with 512,000 sentences, 10,228,002 words, which is made by 54182 
monograms, 1044600 bigrams, 765141 trigrams .....
My questions are:

1. Which combination of several options currently available with 
ngram-count I should use.
2. How many words per parameter should I use . (Joshua Goodman on his 
tutorial research.microsoft.com/~joshuago/lm-tutorial-v7-handouts.ps 
recommend the ratio between Number of words/Number of parameters to be 
greater than 100 or 1000) .
3. Normally, an option -X is to represent all the options for each order 
of n-gram (e.g. -interpolate is like -interpolate1 -interpolate2 ..... 
-interpolateN), but why it doens't work for -kndiscount ?

So far, given this training text of 512,000 sentences, my test set is of 
2000 sentences, 57951 words, and among the LM with order=7 here is the 
best combination I have
-order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4 
-kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate

(also the question with -kndiscount, if I use -kndiscount only, then I 
will get the message: "warning: discount coeff 1 is out of range: 
5.96382e-17")

Thanks for reading this long email, and thanks to all who might want to 
answer this.
Bests
Cuong,