some tech queries
Andreas Stolcke
stolcke at speech.sri.com
Wed May 1 09:53:35 PDT 2002
In message <Pine.LNX.4.20_heb2.08.0204080255320.20347-100000 at pogo.cs.huji.ac.il
>you wrote:
>
> Hi,
>
> I am new to SRILM, and quite new to language modelling at large
> (coming from other domains of n-gram models usage).
>
> I have run some perliminary probes with SRILM (on linux, smooth install)
> and have the following questions:
Sorry for taking a while, but I hope the answers are still useful.
>
> 1. in ngram-count:
> when using -lm with the default -order 3, i had expected -text <textfile>
> to yield the same model as -read <order1> -read <order2> -read <order3>
> where order{1-3} have been obtained through ngram-count -write{1-3}
> (all other paramters being equal). and yet the two LM files differ.
> how come?
You are right, the two methods of getting the counts should be equivalent.
You can test this by doing
ngram-count -text TEST -write NEWCOUNTS
and
ngram-count -read COUNTS -write NEWCOUNTS
and comparing the output. If you find a discrepancy then there might
be a bug and I'd like you to send me a small test case that shows the problem.
BTW, there is no reason to use -write1 -write2 -write3 together if you
are going to combine the counts later. Just -write will do the job.
>
> 2. in ngram-count:
> i'm not quite clear about the multiple -cdiscount flags.
> suppose i want a default -order 3 LM.
> mustn't i give all three D's and have the model interpolate over all
> of these, as eq. (18) in Chen&Goodman (p.15) implies?
> in practice it seems one can specify any subset of the 3 and get
> different models. (are there default Ds?)
The way it is implemented you have complete freedom to use
different discounting methods for different orders of N-grams.
The default is Good-Turing, so
-cdiscount1 D1 -cdiscount3 D3
would use absolute discounting for orders 1 and 3, but GT for bigrams.
(There is no default D value for absolute discounting).
Also, whether or not higher-order estimates use interpolation with
lower-order estimates can be chosen separately for each order.
Not all possible combinations make sense from a theoretical point of
view, so it's up to you to not abuse this flexibility.
> 3. in ngram-count:
> probably closely related to question 2.
> (and prob. due to some confusion i have between backoff & interpolation)
> why are there multiple -interpolate flags.
> again, eq. (18) in C&G appears to imply a recursive all levels
> interpolation. and yet ngram-count appears to take any subset of
> -interpolate{1-3} (in the above example) and yield different LMs.
See above. -interpolate<N> estimates order-N N-gram probabilities by
interpolating with order-(N-1) estimates. The latter could themselses
be interpolated or not, so you control how far the recursion goes.
> 4. combining 2+3:
> if i want an absolute discount model of order, say 3,
> "by the book" C&G eq. (18), what is the proper way to run it?
> assume i have ran ngram-count => get-gt-counts => make-abs-discount
> and obtained <D1> <D2> <D3>.
> a command line example will be highly appreciated.
Correct.
>
> 5. ngram-count vs. ngram:
> if i use ngram-count with some combination of -prune and -minprune
> to obtain a model and then use ngram -ppl, will the result be identical
> to running ngram-count without the pruning flags, and running ngram -ppl
> on the new model with -prune -minprune as was previously done for model
> building?
Correct (again, barring any bugs...).
> 6. for ngram -ppl:
> in -debug 1, i believe, two measures are given per sentence, ppl and ppl1.
> how are they defined?
> is one C&G's $PP_p(T)$ (p.9,top)? then, what is the other?
I get a lot of question about this because it's not documented,
except in the code. ppl1 is the perplexity computed without counting
the end-of-sentence tokens in the denominator (the end-of-sentence
log probabilities are still included in the total log probability).
ppl1 can be more meaningful for comparing perplexities on testsets that
have been segmented in different ways.
--Andreas
More information about the SRILM-User
mailing list