[SRILM User List] Why there are "_meta_1" in LM?
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Dec 6 09:55:42 PST 2012
This happened because the binary LM file contains a record of the full
vocabulary at the time the LM was created, not just the words that
appear as unigrams (as in the ARPA format). You must have done ngram
-renorm or something similar later, which causes unigrams to be created
for all words in the vocabulary.
Attached is a patch that prevents the _meta_ tokens from being included
in that vocabulary. Check that it fixes your problem.
(You can also grab the beta version off the web site.)
Andreas
On 12/2/2012 8:06 PM, Meng Chen wrote:
> I have checked the make-big-lm shell script and found that the
> "_meta_" should be lowercase.
> In line 56 of make-big-lm script. It says:
> metatag=__meta__ #lowercase so it works with ngram-count -tolower
>
> In fact, when I used make-big-lm to train LM, there are not
> "__meta__1" in final arpa LM without the write-binary-lm. So I guess
> it's possible related to the binary format.
>
>
> 2012/12/2 Andreas Stolcke <stolcke at icsi.berkeley.edu
> <mailto:stolcke at icsi.berkeley.edu>>
>
> On 12/1/2012 7:37 AM, Meng CHEN wrote:
>
> Hi, I trained LMs with the write-binary-lm option, however,
> when I converted the LM of bin format into arpa format, I
> found there were 4 more 1-grams in the arpa LM as follows:
> -8.988857 _meta_1
> -8.988857 _meta_2
> -9.201852 _meta_3
> -9.201852 _meta_4
> In facter, these four words do not exisit in my vocab. So
> where are they come from? What should I do to remove them ?
> Thanks!
>
>
> Counts for _META_1 etc. (note the uppercase) are used by
> ngram-count to keep track of counts-of-counts required for
> smoothing. They should never appear in the LM.
>
> I suspect you lowercased the strings in the counts file somewhere
> in your processing, causing these special tokens to no longer be
> recognized.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121206/52ce09f4/attachment.html>
-------------- next part --------------
*** lm/src/NgramLM.cc.dist 2012-10-18 20:31:21.198065100 -0400
--- lm/src/NgramLM.cc 2012-12-05 18:08:22.701858000 -0500
***************
*** 875,881 ****
/*
* Vocabulary index
*/
! vocab.writeIndexMap(file);
long long offset = ftello(file);
--- 875,881 ----
/*
* Vocabulary index
*/
! vocab.writeIndexMap(file, true);
long long offset = ftello(file);
***************
*** 1051,1057 ****
fprintf(file, "data: %s\n", dataFile);
}
! vocab.writeIndexMap(file);
long long offset = ftello(dat);
--- 1051,1057 ----
fprintf(file, "data: %s\n", dataFile);
}
! vocab.writeIndexMap(file, true);
long long offset = ftello(dat);
*** lm/src/Vocab.cc.dist 2012-10-29 17:44:22.423039800 -0400
--- lm/src/Vocab.cc 2012-12-05 18:11:11.745755000 -0500
***************
*** 841,855 ****
* The format is ascii with one word per line:
* index string
* The mapping is terminated by EOF or a line consisting only of ".".
*/
void
! Vocab::writeIndexMap(File &file)
{
// Output index map in order of internal vocab indices.
// This ensures that vocab strings are assigned indices in the same order
// on reading, and ensures faster insertions into SArray-based tries.
for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! if (byIndex[i]) {
fprintf(file, "%u %s\n", i, byIndex[i]);
}
}
--- 841,856 ----
* The format is ascii with one word per line:
* index string
* The mapping is terminated by EOF or a line consisting only of ".".
+ * If writingLM is true, omit words that should not appear in LMs.
*/
void
! Vocab::writeIndexMap(File &file, Boolean writingLM)
{
// Output index map in order of internal vocab indices.
// This ensures that vocab strings are assigned indices in the same order
// on reading, and ensures faster insertions into SArray-based tries.
for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! if (byIndex[i] && !(writingLM && isMetaTag(i))) {
fprintf(file, "%u %s\n", i, byIndex[i]);
}
}
More information about the SRILM-User
mailing list