[SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Jun 20 19:02:37 PDT 2013
On 6/19/2013 4:38 PM, Sander Maijers wrote:
> On 19-06-13 01:44, Andreas Stolcke wrote:
>>> 2. In this case, what kind of smoothing goes on under the hood of P'?
>>> I have created my skip LM with the following parameters to
>>> 'ngram-count':
>>> -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s
>>> -limit-vocab -tolower
>>> does that also incorporate backoff and Good-Turing discounting like it
>>> would without '-skip'?
>> Yes, the underlying estimation algorithm (the M-step of the EM
>> algorithm) is a standard backoff ngram estimation.
>> The only thing that's nonstandard is that the ngram counts going into
>> the estimation are fractional counts, as computed in the E-step.
>> Therefore, the same limitations as triggered by the ngram-count
>> -float-counts option apply. Mainly, you can use only certain
>> discounting methods, those that can deal with fractional counts. In
>> particular, the methods based on counts-of-counts are out, so no GT or
>> KN discounting. You should get an error message if you try to use them.
>
> I did not specify a discounting method in the command line I gave, and
> if it can't be the default GT, then which discount method will be
> applied to the counts prior to the E step?
I had to review the code (written some 17 years ago) to remind myself
how the smoothing is handled with skip-ngrams ...
It looks like a short-cut is used: the discounting parameters are
estimated on the standard counts, and then applied to the fractional EM
counts without recomputing them at each iteration. This means you
can use any method, but of course the results are probably suboptimal.
It might be better to recompute discounts after each E-step, and you
would do that by modifying the SkipNgram::estimateMstep() function and
inserting calls to the discounts[]->estimate() function ahead of the
Ngram::estimate() call.
I also noticed there is a bug in ngram-count.cc that will keep things
from working when you read counts from a file rather than computing them
from text (i.e., if you're using ngram-count -read instead of
ngram-count -text). The problem is that, to estimate a skip-ngram of
order N, you need counts of order N+1. The attached patch will fix
that, but you still need to make sure you extract the counts of order
N+1 when you're doing that in a separate step.
Below is a little script that you can stick in
$SRILM/lm/test/tests/ngram-count-skip/run-test and then exercise
building and testing a skip-bigram from trigram counts. This actually
doesn't produce lower perplexity than the regular bigram, but when I
apply the same method to 4gram counts (which are not distributed with
SRILM), the skip-trigram does have lower perplexity than the
corresponding standard trigram.
In any case, there are many possible variations on skip-ngrams and the
SRILM implementation should be considered more as an exercise to inspire
experimentation.
Andreas
------------------ ngram-count-skip/run-test -------------------------------
#!/bin/sh
dir=../ngram-count-gt
if [ -f $dir/swbd.3grams.gz ]; then
gz=.gz
else
gz=
fi
smooth="-wbdiscount -gt3min 1 -gt4min 1"
order=2
counts=$dir/swbd.3grams$gz
# create LM from counts
ngram-count -debug 1 \
-order $order \
-skip -skip-init 0.0 \
-em-iters 3 \
$smooth \
-read $counts \
-vocab $dir/eval2001.vocab \
-lm skiplm.${order}bo$gz
ngram -debug 0 -order $order \
-skip -lm skiplm.${order}bo$gz \
-ppl $dir/eval97.text
rm -f skiplm.${order}bo$gz
-------------- next part --------------
Index: lm/src/ngram-count.cc
===================================================================
RCS file: /home/srilm/CVS/srilm/lm/src/ngram-count.cc,v
retrieving revision 1.74
diff -c -r1.74 ngram-count.cc
*** lm/src/ngram-count.cc 1 Mar 2013 16:34:37 -0000 1.74
--- lm/src/ngram-count.cc 21 Jun 2013 01:29:23 -0000
***************
*** 434,453 ****
if (readFile) {
File file(readFile, "r");
if (readWithMincounts) {
! makeArray(Count, minCounts, order);
/* construct min-counts array from -gtNmin options */
unsigned i;
! for (i = 0; i < order && i < maxorder; i ++) {
minCounts[i] = gtmin[i + 1];
}
! for ( ; i < order; i ++) {
minCounts[i] = gtmin[0];
}
! USE_STATS(readMinCounts(file, order, minCounts));
} else {
! USE_STATS(read(file, order, limitVocab));
}
}
--- 434,455 ----
if (readFile) {
File file(readFile, "r");
+ unsigned countOrder = USE_STATS(getorder());
+
if (readWithMincounts) {
! makeArray(Count, minCounts, countOrder);
/* construct min-counts array from -gtNmin options */
unsigned i;
! for (i = 0; i < countOrder && i < maxorder; i ++) {
minCounts[i] = gtmin[i + 1];
}
! for ( ; i < countOrder; i ++) {
minCounts[i] = gtmin[0];
}
! USE_STATS(readMinCounts(file, countOrder, minCounts));
} else {
! USE_STATS(read(file, countOrder, limitVocab));
}
}
More information about the SRILM-User
mailing list