[SRILM User List] A problem with ngram-count with option "-text-has-weights"

tuzhaopeng tuzhaopeng at ict.ac.cn
Mon Mar 22 23:33:29 PDT 2010

Hi  People,

I meet a problem when I train a language model with option "-text-has-weights".

The input text with fraction count is as below:


1 china_H today
1 on_H
1 smuggling_H scale
0.000283545 less_H or
1 under_H
1 's_H last year
0.202422 more_H
0.000283545 more_H
1 less_H more or
1 brought_H
1 crackdown_H the
1.41754e-05 smuggling_H large - scale
1 of_H
0.105263 less_H more or
0.0021736 brought_H more
1.02756e-05 brought_H less
0.202422 been_H
1 been_H
0.105263 been_H
0.0021736 been_H


The fraction count and sentence are separated by space.

And when we use the kn-discount, it went wrong, the command is:

./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -kndiscount 

and the error message is:

error in discount estimator for order 1 

Then I went to look for more information on Internet, and found that for the option "-float-counts", only certain discounting methods support non-integer counts (wbdiscount and cdiscount). So I use the wb-discount with the command:

./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -wbdiscount -debug 3 

and the output information is:

using WittenBell for 1-grams
using WittenBell for 2-grams
using WittenBell for 3-grams
warning: distributing 1 left-over probability mass over 2 zeroton words
writing 3 1-grams
writing 0 2-grams
writing 0 3-grams

It seems that everything goes well, however, in the lm file, there is only:

ngram 1=3
ngram 2=0
ngram 3=0
-0.30103        </s>
-99     <s>
-0.30103        <unk>

So what is the problem? Is there something wrong with the input file or the command line?

Thanks and Regards

Tu Zhaopeng


 Tu Zhaopeng
 Institute of Computing Technology,
 Chinese Academy of Sciences
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100323/d865c3f1/attachment-0002.html>

More information about the SRILM-User mailing list