[SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker

Thu Jun 20 19:02:37 PDT 2013

On 6/19/2013 4:38 PM, Sander Maijers wrote:
> On 19-06-13 01:44, Andreas Stolcke wrote:
>>> 2. In this case, what kind of smoothing goes on under the hood of P'?
>>> I have created my skip LM with the following parameters to 
>>> 'ngram-count':
>>> -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s
>>> -limit-vocab -tolower
>>> does that also incorporate backoff and Good-Turing discounting like it
>>> would without '-skip'?
>> Yes, the underlying estimation algorithm (the M-step of the EM
>> algorithm) is a standard backoff ngram estimation.
>> The only thing that's nonstandard is that the ngram counts going into
>> the estimation are fractional counts, as computed in the E-step.
>> Therefore, the same limitations as triggered by the ngram-count
>> -float-counts option apply.   Mainly, you can use only certain
>> discounting methods, those that can deal with fractional counts.  In
>> particular, the methods based on counts-of-counts are out, so no GT or
>> KN discounting.  You should get an error message if you try to use them.
>
> I did not specify a discounting method in the command line I gave, and 
> if it can't be the default GT, then which discount method will be 
> applied to the counts prior to the E step?

I had to review the code (written some 17 years ago) to remind myself 
how the smoothing is handled with skip-ngrams ...

It looks like a short-cut is used:  the discounting parameters are 
estimated on the standard counts, and then applied to the fractional EM 
counts without recomputing them at each iteration.     This means you 
can use any method, but of course the results are probably suboptimal.
It might be better to recompute discounts after each E-step, and you 
would do that by modifying the SkipNgram::estimateMstep() function and 
inserting calls to the discounts[]->estimate() function ahead of the 
Ngram::estimate() call.

I also noticed there is a bug in ngram-count.cc that will keep things 
from working when you read counts from a file rather than computing them 
from text (i.e., if you're using ngram-count -read instead of 
ngram-count -text).   The problem is that, to estimate a skip-ngram of 
order N, you need counts of order N+1.  The attached patch will fix 
that, but you still need to make sure you extract the counts of order 
N+1 when you're doing that in a separate step.

Below is a little script that you can stick in 
$SRILM/lm/test/tests/ngram-count-skip/run-test and then exercise 
building and testing a skip-bigram from trigram counts.  This actually 
doesn't produce lower perplexity than the regular bigram, but when I 
apply the same method to 4gram counts (which are not distributed with 
SRILM), the skip-trigram does have lower perplexity than the 
corresponding standard trigram.

In any case, there are many possible variations on skip-ngrams and the 
SRILM implementation should be considered more as an exercise to inspire 
experimentation.

Andreas

------------------ ngram-count-skip/run-test -------------------------------
#!/bin/sh

dir=../ngram-count-gt

if [ -f $dir/swbd.3grams.gz ]; then
         gz=.gz
else
         gz=
fi

smooth="-wbdiscount -gt3min 1 -gt4min 1"

order=2
counts=$dir/swbd.3grams$gz

# create LM from counts
ngram-count -debug 1 \
         -order $order \
         -skip -skip-init 0.0 \
         -em-iters 3 \
         $smooth \
         -read $counts \
         -vocab $dir/eval2001.vocab \
         -lm skiplm.${order}bo$gz

ngram -debug 0 -order $order \
         -skip -lm skiplm.${order}bo$gz \
         -ppl $dir/eval97.text

rm -f skiplm.${order}bo$gz

-------------- next part --------------
Index: lm/src/ngram-count.cc
===================================================================
RCS file: /home/srilm/CVS/srilm/lm/src/ngram-count.cc,v
retrieving revision 1.74
diff -c -r1.74 ngram-count.cc
*** lm/src/ngram-count.cc	1 Mar 2013 16:34:37 -0000	1.74
--- lm/src/ngram-count.cc	21 Jun 2013 01:29:23 -0000
***************
*** 434,453 ****
      if (readFile) {
  	File file(readFile, "r");

  	if (readWithMincounts) {
! 	    makeArray(Count, minCounts, order);

  	    /* construct min-counts array from -gtNmin options */
  	    unsigned i;
! 	    for (i = 0; i < order && i < maxorder; i ++) {
  		minCounts[i] = gtmin[i + 1];
  	    }
! 	    for ( ; i < order; i ++) {
  		minCounts[i] = gtmin[0];
  	    }
! 	    USE_STATS(readMinCounts(file, order, minCounts));
  	} else {
! 	    USE_STATS(read(file, order, limitVocab));
  	}
      }

--- 434,455 ----
      if (readFile) {
  	File file(readFile, "r");

+ 	unsigned countOrder = USE_STATS(getorder());
+ 
  	if (readWithMincounts) {
! 	    makeArray(Count, minCounts, countOrder);

  	    /* construct min-counts array from -gtNmin options */
  	    unsigned i;
! 	    for (i = 0; i < countOrder && i < maxorder; i ++) {
  		minCounts[i] = gtmin[i + 1];
  	    }
! 	    for ( ; i < countOrder; i ++) {
  		minCounts[i] = gtmin[0];
  	    }
! 	    USE_STATS(readMinCounts(file, countOrder, minCounts));
  	} else {
! 	    USE_STATS(read(file, countOrder, limitVocab));
  	}
      }