Cuong Huy To cuong at idiap.ch
Thu Aug 3 05:06:37 PDT 2006

```Hi every one

This question is for SRILM - 1.4.1

I am working on Statistical Machine Translation, basically the problem
is to find the best sentence e (english) given the input sentence f
(foreign)
e = argmax p(e|f) = argmax p(f|e).p(e).
In which, the p(f|e) is about the translation model (including the
lexicon and alignment models)

What I am concerning about is p(e), the language model.

My corpus is EuroParl (European Parliament Sessions), now I'm working
with 512,000 sentences, 10,228,002 words, which is made by 54182
monograms, 1044600 bigrams, 765141 trigrams .....
My questions are:

1. Which combination of several options currently available with
ngram-count I should use.
2. How many words per parameter should I use . (Joshua Goodman on his
tutorial research.microsoft.com/~joshuago/lm-tutorial-v7-handouts.ps
recommend the ratio between Number of words/Number of parameters to be
greater than 100 or 1000) .
3. Normally, an option -X is to represent all the options for each order
of n-gram (e.g. -interpolate is like -interpolate1 -interpolate2 .....
-interpolateN), but why it doens't work for -kndiscount ?

So far, given this training text of 512,000 sentences, my test set is of
2000 sentences, 57951 words, and among the LM with order=7 here is the
best combination I have
-order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4
-kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate

(also the question with -kndiscount, if I use -kndiscount only, then I
will get the message: "warning: discount coeff 1 is out of range:
5.96382e-17")

Thanks for reading this long email, and thanks to all who might want to