[SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker

Mon Jun 24 10:41:29 PDT 2013

On 21-06-13 04:02, Andreas Stolcke wrote:
> On 6/19/2013 4:38 PM, Sander Maijers wrote:
>> On 19-06-13 01:44, Andreas Stolcke wrote:
>>>> 2. In this case, what kind of smoothing goes on under the hood of P'?
>>>> I have created my skip LM with the following parameters to
>>>> 'ngram-count':
>>>> -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s
>>>> -limit-vocab -tolower
>>>> does that also incorporate backoff and Good-Turing discounting like it
>>>> would without '-skip'?
>>> Yes, the underlying estimation algorithm (the M-step of the EM
>>> algorithm) is a standard backoff ngram estimation.
>>> The only thing that's nonstandard is that the ngram counts going into
>>> the estimation are fractional counts, as computed in the E-step.
>>> Therefore, the same limitations as triggered by the ngram-count
>>> -float-counts option apply.   Mainly, you can use only certain
>>> discounting methods, those that can deal with fractional counts.  In
>>> particular, the methods based on counts-of-counts are out, so no GT or
>>> KN discounting.  You should get an error message if you try to use them.
>>
>> I did not specify a discounting method in the command line I gave, and
>> if it can't be the default GT, then which discount method will be
>> applied to the counts prior to the E step?
>
> I had to review the code (written some 17 years ago) to remind myself
> how the smoothing is handled with skip-ngrams ...
>
> It looks like a short-cut is used:  the discounting parameters are
> estimated on the standard counts, and then applied to the fractional EM
> counts without recomputing them at each iteration.     This means you
> can use any method, but of course the results are probably suboptimal.
> It might be better to recompute discounts after each E-step, and you
> would do that by modifying the SkipNgram::estimateMstep() function and
> inserting calls to the discounts[]->estimate() function ahead of the
> Ngram::estimate() call.
>
> I also noticed there is a bug in ngram-count.cc that will keep things
> from working when you read counts from a file rather than computing them
> from text (i.e., if you're using ngram-count -read instead of
> ngram-count -text).   The problem is that, to estimate a skip-ngram of
> order N, you need counts of order N+1.  The attached patch will fix
> that, but you still need to make sure you extract the counts of order
> N+1 when you're doing that in a separate step.
>
> Below is a little script that you can stick in
> $SRILM/lm/test/tests/ngram-count-skip/run-test and then exercise
> building and testing a skip-bigram from trigram counts.  This actually
> doesn't produce lower perplexity than the regular bigram, but when I
> apply the same method to 4gram counts (which are not distributed with
> SRILM), the skip-trigram does have lower perplexity than the
> corresponding standard trigram.
>
> In any case, there are many possible variations on skip-ngrams and the
> SRILM implementation should be considered more as an exercise to inspire
> experimentation.
>
> Andreas

Thank you for your work! The default discounting method was used with 
both my baseline and the skip LM.

I was also looking at the code, however I've run my experiments already 
and need to wrap up quickly. After my thesis I may have a look again at 
the SRILM code.

Based on the equations you described to me and the code, I do not see 
the fundamental difference with skip N-gram model and Jelinek-Mercer 
smoothing / deleted interpolation (Chen & Goodman, 1999, eqn. 4 p. 364). 
In the skip LM the skip probabilities substitute the lambda weights in 
the Jelinek-Mercer equation, and are estimated in the perhaps special 
way you explained. Is there something I miss?

> ------------------ ngram-count-skip/run-test
> -------------------------------
> #!/bin/sh
>
> dir=../ngram-count-gt
>
> if [ -f $dir/swbd.3grams.gz ]; then
>          gz=.gz
> else
>          gz=
> fi
>
> smooth="-wbdiscount -gt3min 1 -gt4min 1"
>
> order=2
> counts=$dir/swbd.3grams$gz
>
> # create LM from counts
> ngram-count -debug 1 \
>          -order $order \
>          -skip -skip-init 0.0 \
>          -em-iters 3 \
>          $smooth \
>          -read $counts \
>          -vocab $dir/eval2001.vocab \
>          -lm skiplm.${order}bo$gz
>
> ngram -debug 0 -order $order \
>          -skip -lm skiplm.${order}bo$gz \
>          -ppl $dir/eval97.text
>
> rm -f skiplm.${order}bo$gz
>
>
>