[SRILM User List] Why there are "_meta_1" in LM?
Andreas Stolcke
stolcke at icsi.berkeley.edu
Sat Dec 1 09:08:50 PST 2012
On 12/1/2012 7:37 AM, Meng CHEN wrote:
> Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows:
> -8.988857 _meta_1
> -8.988857 _meta_2
> -9.201852 _meta_3
> -9.201852 _meta_4
> In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ?
> Thanks!
Counts for _META_1 etc. (note the uppercase) are used by ngram-count to
keep track of counts-of-counts required for smoothing. They should
never appear in the LM.
I suspect you lowercased the strings in the counts file somewhere in
your processing, causing these special tokens to no longer be recognized.
Andreas
More information about the SRILM-User
mailing list