class LM

Tue Oct 8 08:52:48 PDT 2002

In message <3DA2D0DD.AE6387DB at uni-mb.si>you wrote:
> Andreas!
> 
> Thank you for your answers.
> 
> Few more questions:
> 
> 1.)
> I understand the transitions like:
> 
> [2gram]POSITION = 2 FROM: <504,NULL> TO: <756 504,NULL> WORD = primeri
> PROB = -1.76748 EXPANDPROB = 0.0106105
> 
> (504, 756 are classs),
> 
> but not the transitions like:
> 
> [OOV]POSITION = 2 FROM: <504,NULL> TO: <,NULL> WORD = primeri PROB =
> -inf
> 
> What does [OOV] mean? These transitions are not present in  the test
> example of the toolkit.

[OOV] means a word was not found even in the unigrams of your model.
The ClassNgram code handles LMs that contains both word and class ngrams.
It therefore always tries to also find an N-gram probabilty for each 
word (without class lookup), and if you don't include all class member words
in your vocabulary when building the LM you will get this "OOV" condition.
But is is harmless since presumably all your words get some probability 
by virtue of being members in some class.

> 2.) In which case  is the history string cleaned (FROM: <504,NULL> TO:
> <,NULL>) ?

When there a are no histories in the LM that start with the given class
(504).  The history is kept only a long as it needs to be to compute
subsequent N-gram probabilities (so as to minimize the state space).

> 
> 3.) Is the vocabulary size in SRI-LM limited?

To the range of unsigned integers (2^32).

--Andreas