some tech queries

Wed May 1 09:53:35 PDT 2002

In message <Pine.LNX.4.20_heb2.08.0204080255320.20347-100000 at pogo.cs.huji.ac.il
>you wrote:
> 
> Hi,
> 
> I am new to SRILM, and quite new to language modelling at large
> (coming from other domains of n-gram models usage).
> 
> I have run some perliminary probes with SRILM (on linux, smooth install)
> and have the following questions:

Sorry for taking a while, but I hope the answers are still useful.

> 
> 1. in ngram-count:
>    when using -lm with the default -order 3, i had expected -text <textfile>
>    to yield the same model as -read <order1> -read <order2> -read <order3>
>    where order{1-3} have been obtained through ngram-count -write{1-3}
>    (all other paramters being equal). and yet the two LM files differ.
>    how come?

You are right, the two methods of getting the counts should be equivalent.
You can test this by doing 

	ngram-count -text TEST -write NEWCOUNTS

and
	ngram-count -read COUNTS -write NEWCOUNTS

and comparing the output.  If you find a discrepancy then there might
be a bug and I'd like you to send me a small test case that shows the problem.

BTW, there is no reason to use -write1 -write2 -write3 together if you 
are going to combine the counts later. Just -write will do the job.

> 
> 2. in ngram-count:
>    i'm not quite clear about the multiple -cdiscount flags.
>    suppose i want a default -order 3 LM.
>    mustn't i give all three D's and have the model interpolate over all
>    of these, as eq. (18) in Chen&Goodman (p.15) implies?
>    in practice it seems one can specify any subset of the 3 and get
>    different models. (are there default Ds?)

The way it is implemented you have complete freedom to use 
different discounting methods for different orders of N-grams.
The default is Good-Turing, so

	-cdiscount1 D1 -cdiscount3 D3

would use absolute discounting for orders 1 and 3, but GT for bigrams.
(There is no default D value for absolute discounting).

Also, whether or not higher-order estimates use interpolation with 
lower-order estimates can be chosen separately for each order.

Not all possible combinations make sense from a theoretical point of 
view, so it's up to you to not abuse this flexibility.

> 3. in ngram-count:
>    probably closely related to question 2.
>    (and prob. due to some confusion i have between backoff & interpolation)
>    why are there multiple -interpolate flags.
>    again, eq. (18) in C&G appears to imply a recursive all levels
>    interpolation. and yet ngram-count appears to take any subset of
>    -interpolate{1-3} (in the above example) and yield different LMs.

See above.  -interpolate<N> estimates order-N N-gram probabilities by 
interpolating with order-(N-1) estimates.  The latter could themselses
be interpolated or not, so you control how far the recursion goes.

> 4. combining 2+3:
>    if i want an absolute discount model of order, say 3, 
>    "by the book" C&G eq. (18), what is the proper way to run it? 
>    assume i have ran ngram-count => get-gt-counts => make-abs-discount
>    and obtained <D1> <D2> <D3>.
>    a command line example will be highly appreciated.

Correct.

> 
> 5. ngram-count vs. ngram:
>    if i use ngram-count with some combination of -prune and -minprune 
>    to obtain a model and then use ngram -ppl, will the result be identical
>    to running ngram-count without the pruning flags, and running ngram -ppl
>    on the new model with -prune -minprune as was previously done for model
>    building?

Correct (again, barring any bugs...).

> 6. for ngram -ppl:
>    in -debug 1, i believe, two measures are given per sentence, ppl and ppl1.
>    how are they defined? 
>    is one C&G's $PP_p(T)$ (p.9,top)? then, what is the other?

I get a lot of question about this because it's not documented,
except in the code.  ppl1 is the perplexity computed without counting 
the end-of-sentence tokens in the denominator (the end-of-sentence 
log probabilities are still included in the total log probability).
ppl1 can be more meaningful for comparing perplexities on testsets that
have been segmented in different ways.

--Andreas