[SRILM User List] Count-lm reference request
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Oct 2 08:55:51 PDT 2013
On 10/2/2013 1:16 AM, E wrote:
> Thanks for the pointers! Three questions -
>
> 1. The same number of bins are used for all n-grams even though number
> of ngrams for each N may differ. In web1T,
> Number of unigrams: 13,588,391
> Number of fivegrams: 1,176,470,663
> Would it make any improvement if fivegrams were binned more number of
> times than unigrams?
That's good idea, but I haven't tried it, so I cannot say how much it
would help.
It might also help to just have more bins for lower-order ngrams since
there are more samples of them (more data, hence more parameters can be
estimated).
>
> 2. For a particular ngram in test data, the algorithm will decide
> which bin Wij's to use based on how many times that n-gram occurred in
> training data. Is this right?
Right.
>
> 3. What does it mean when some weights are zero after tuning them. I
> used just 10 sentences (5 repeated) in tune.txt and got
> google.countlm as at the bottom.
>
> For ex. w01, w02 are non-zero but w03 is zero. Does this mean that in
> the development set, there were no trigrams that corresponded to
> counts in bin 0?
Correct.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131002/54716f61/attachment.html>
More information about the SRILM-User
mailing list