ngram-count -read performance difference for different tokens

Sat Dec 20 20:11:25 PST 2008

In message <4ded78d60812201600k45933f98l2fc40bd7a7221cd2 at mail.gmail.com>you wro
te:
> 
> Dear SRILM List Members,
> 
> I was experimenting with the "-use-server" option of ngram and it appears to
> work for "-ppl" calculations from text but I was receiving different numbers
> when working with count files. With some debugging, I realized that this was
> due to the server receiving <unk> tokens from the client.
> 
> I made the following modification:
> 
> line 352, LM.cc, version 1.5.7:
>     //vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
>     vocab.addWords(words, wids, order + 1);
> 
> and I am able to get the same results with or without using a server.
> 
> I have not checked whether this will effect "-cache-served-ngrams" policy or
> whether this may have other impacts on the results.

Good catch.
Actually, the correct fix is 

*** LM.cc	2008/12/17 00:17:26	1.66
--- LM.cc	2008/12/21 04:09:50
***************
*** 631,637 ****
  	/* 
  	 * Map words to indices
  	 */
! 	vocab.getIndices(words, wids, order + 1, vocab.unkIndex());

  	/*
  	 *  Update the counts
--- 631,641 ----
  	/* 
  	 * Map words to indices
  	 */
! 	if (addUnkWords()) {
! 	    vocab.addWords(words, wids, order + 1);
! 	} else {
! 	    vocab.getIndices(words, wids, order + 1, vocab.unkIndex());
! 	}

  	/*
  	 *  Update the counts

(compare the code in LM::sentenceProb()).

Andreas