[SRILM User List] Why there are "_meta_1" in LM?

Sun Dec 2 20:06:54 PST 2012

I have checked the make-big-lm shell script and found that the "_meta_"
should be lowercase.
In line 56 of make-big-lm script. It says:
metatag=__meta__   #lowercase so it works with ngram-count -tolower

In fact, when I used make-big-lm to train LM, there are not "__meta__1" in
final arpa LM without the write-binary-lm. So I guess it's possible related
to the binary format.

2012/12/2 Andreas Stolcke <stolcke at icsi.berkeley.edu>

> On 12/1/2012 7:37 AM, Meng CHEN wrote:
>
>> Hi, I trained LMs with the write-binary-lm option, however, when I
>> converted the LM of bin format into arpa format, I found there were 4 more
>> 1-grams in the arpa LM as follows:
>> -8.988857 _meta_1
>> -8.988857 _meta_2
>> -9.201852 _meta_3
>> -9.201852 _meta_4
>> In facter, these four words do not exisit in my vocab. So where are they
>> come from? What should I do to remove them ?
>> Thanks!
>>
>
> Counts for _META_1 etc. (note the uppercase) are used by ngram-count to
> keep track of counts-of-counts required for smoothing.   They should never
> appear in the LM.
>
> I suspect you lowercased the strings in the counts file somewhere in your
> processing, causing these special tokens to no longer be recognized.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121203/853d213b/attachment.html>