[SRILM User List] Count-lm reference request

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Oct 2 08:55:51 PDT 2013


On 10/2/2013 1:16 AM, E wrote:
> Thanks for the pointers! Three questions -
>
> 1. The same number of bins are used for all n-grams even though number 
> of ngrams for each N may differ. In web1T,
> Number of unigrams:         13,588,391
> Number of fivegrams:     1,176,470,663
> Would it make any improvement if fivegrams were binned more number of 
> times than unigrams?
That's good idea, but I haven't tried it, so I cannot say how much it 
would help.
It might also help to just have more bins for lower-order ngrams since 
there are more samples of them (more data, hence more parameters can be 
estimated).

>
> 2. For a particular ngram in test data, the algorithm will decide 
> which bin Wij's to use based on how many times that n-gram occurred in 
> training data. Is this right?
Right.

>
> 3. What does it mean when some weights are zero after tuning them. I 
> used just 10 sentences  (5 repeated) in tune.txt and got 
> google.countlm as at the bottom.
>
> For ex. w01, w02 are non-zero but w03 is zero. Does this mean that in 
> the development set, there were no trigrams that corresponded to 
> counts in bin 0?

Correct.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131002/54716f61/attachment.html>


More information about the SRILM-User mailing list