[SRILM User List] Why there are "_meta_1" in LM?

Andreas Stolcke stolcke at icsi.berkeley.edu
Sat Dec 1 09:08:50 PST 2012


On 12/1/2012 7:37 AM, Meng CHEN wrote:
> Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows:
> -8.988857 _meta_1
> -8.988857 _meta_2
> -9.201852 _meta_3
> -9.201852 _meta_4
> In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ?
> Thanks!

Counts for _META_1 etc. (note the uppercase) are used by ngram-count to 
keep track of counts-of-counts required for smoothing.   They should 
never appear in the LM.

I suspect you lowercased the strings in the counts file somewhere in 
your processing, causing these special tokens to no longer be recognized.

Andreas



More information about the SRILM-User mailing list