ngram-count -read performance difference for different tokens
Andreas Stolcke
stolcke at speech.sri.com
Sat Dec 20 20:11:25 PST 2008
In message <4ded78d60812201600k45933f98l2fc40bd7a7221cd2 at mail.gmail.com>you wro
te:
>
> Dear SRILM List Members,
>
> I was experimenting with the "-use-server" option of ngram and it appears to
> work for "-ppl" calculations from text but I was receiving different numbers
> when working with count files. With some debugging, I realized that this was
> due to the server receiving <unk> tokens from the client.
>
> I made the following modification:
>
> line 352, LM.cc, version 1.5.7:
> //vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
> vocab.addWords(words, wids, order + 1);
>
> and I am able to get the same results with or without using a server.
>
> I have not checked whether this will effect "-cache-served-ngrams" policy or
> whether this may have other impacts on the results.
Good catch.
Actually, the correct fix is
*** LM.cc 2008/12/17 00:17:26 1.66
--- LM.cc 2008/12/21 04:09:50
***************
*** 631,637 ****
/*
* Map words to indices
*/
! vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
/*
* Update the counts
--- 631,641 ----
/*
* Map words to indices
*/
! if (addUnkWords()) {
! vocab.addWords(words, wids, order + 1);
! } else {
! vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
! }
/*
* Update the counts
(compare the code in LM::sentenceProb()).
Andreas
More information about the SRILM-User
mailing list