[SRILM User List] Why there are "_meta_1" in LM?

Thu Dec 6 09:55:42 PST 2012

This happened because the binary LM file contains a record of the full 
vocabulary at the time the LM was created, not just the words that 
appear as unigrams (as in the ARPA format).  You must have done  ngram 
-renorm or something similar later, which causes unigrams to be created 
for all words in the vocabulary.

Attached is a patch that prevents the _meta_  tokens from being included 
in that vocabulary.  Check that it fixes your problem.
(You can also grab the beta version off the web site.)

Andreas

On 12/2/2012 8:06 PM, Meng Chen wrote:
> I have checked the make-big-lm shell script and found that the 
> "_meta_" should be lowercase.
> In line 56 of make-big-lm script. It says:
> metatag=__meta__   #lowercase so it works with ngram-count -tolower
>
> In fact, when I used make-big-lm to train LM, there are not 
> "__meta__1" in final arpa LM without the write-binary-lm. So I guess 
> it's possible related to the binary format.
>
>
> 2012/12/2 Andreas Stolcke <stolcke at icsi.berkeley.edu 
> <mailto:stolcke at icsi.berkeley.edu>>
>
>     On 12/1/2012 7:37 AM, Meng CHEN wrote:
>
>         Hi, I trained LMs with the write-binary-lm option, however,
>         when I converted the LM of bin format into arpa format, I
>         found there were 4 more 1-grams in the arpa LM as follows:
>         -8.988857 _meta_1
>         -8.988857 _meta_2
>         -9.201852 _meta_3
>         -9.201852 _meta_4
>         In facter, these four words do not exisit in my vocab. So
>         where are they come from? What should I do to remove them ?
>         Thanks!
>
>
>     Counts for _META_1 etc. (note the uppercase) are used by
>     ngram-count to keep track of counts-of-counts required for
>     smoothing.   They should never appear in the LM.
>
>     I suspect you lowercased the strings in the counts file somewhere
>     in your processing, causing these special tokens to no longer be
>     recognized.
>
>     Andreas
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121206/52ce09f4/attachment.html>
-------------- next part --------------
*** lm/src/NgramLM.cc.dist	2012-10-18 20:31:21.198065100 -0400
--- lm/src/NgramLM.cc	2012-12-05 18:08:22.701858000 -0500
***************
*** 875,881 ****
      /*
       * Vocabulary index
       */
!     vocab.writeIndexMap(file);

      long long offset = ftello(file);

--- 875,881 ----
      /*
       * Vocabulary index
       */
!     vocab.writeIndexMap(file, true);

      long long offset = ftello(file);

***************
*** 1051,1057 ****
  	fprintf(file, "data: %s\n", dataFile);  
      }

!     vocab.writeIndexMap(file);

      long long offset = ftello(dat);

--- 1051,1057 ----
  	fprintf(file, "data: %s\n", dataFile);  
      }

!     vocab.writeIndexMap(file, true);

      long long offset = ftello(dat);

*** lm/src/Vocab.cc.dist	2012-10-29 17:44:22.423039800 -0400
--- lm/src/Vocab.cc	2012-12-05 18:11:11.745755000 -0500
***************
*** 841,855 ****
   *	The format is ascii with one word per line:
   *		index	string
   *	The mapping is terminated by EOF or a line consisting only of ".".
   */
  void
! Vocab::writeIndexMap(File &file)
  {
      // Output index map in order of internal vocab indices.
      // This ensures that vocab strings are assigned indices in the same order
      // on reading, and ensures faster insertions into SArray-based tries.
      for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! 	if (byIndex[i]) {
  	    fprintf(file, "%u %s\n", i, byIndex[i]);
  	}
      }
--- 841,856 ----
   *	The format is ascii with one word per line:
   *		index	string
   *	The mapping is terminated by EOF or a line consisting only of ".".
+  *	If writingLM is true, omit words that should not appear in LMs.
   */
  void
! Vocab::writeIndexMap(File &file, Boolean writingLM)
  {
      // Output index map in order of internal vocab indices.
      // This ensures that vocab strings are assigned indices in the same order
      // on reading, and ensures faster insertions into SArray-based tries.
      for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! 	if (byIndex[i] && !(writingLM && isMetaTag(i))) {
  	    fprintf(file, "%u %s\n", i, byIndex[i]);
  	}
      }