Open-vocabulary LM
Andreas Stolcke
stolcke at speech.sri.com
Tue Feb 25 09:02:59 PST 2003
Amelie,
it is possible if there are no unknown words in your data, or if
you didn't specify a vocabulary file (because then all words are
added implicitly). It is also possible that you set ngram cutoffs
such that all ngrams involving <unk> fall below the cutoffs and are
therefore excluded from the LM.
To understand what's going on run ngram-count with
-write COUNTFILE
(in addition to the other options you use) and check what ngrams are
generated containing <unk>.
--Andreas
In message <3E5B960C.6010704 at ira.uka.de>you wrote:
> Hi,
> Is it normal that in an open-vocabulary LM (built with the "-unk"
> option) the <unk> token is present as unigram, but not in bigrams and
> trigrams?
> (Sorry if this is a silly question, but I am not so familiar with
> language models, and I was told that it would not be the case with other
> toolkits).
> Thanks again,
>
> Amélie
>
> --
> --------------------------------------------------------------------
> Amélie DELTOUR
> ENSIMAG / Universität Karlsruhe
> E-mail : amelie.deltour at ira.uka.de
> --------------------------------------------------------------------
>
>
More information about the SRILM-User
mailing list