[SRILM User List] Restricting language model on n-gram types with at least n occurrences
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Aug 23 11:16:04 PDT 2018
The problem is that the -gt2min option is overridden when reading the GT
parameters from file (-gt1 option).
Edit the "mincount" value in the file by hand, then run the ngram-count
for LM estimation.
Andreas
On 8/23/2018 5:32 AM, claude.vividsky at gmail.com wrote:
> Hello,
>
> we would like to build a language model where only n-gram types with at
> least 2 or more occurrences are regarded. N-grams which occur only once are
> to be discarded.
>
> We use the following workflow (example for bigrams)
>
> Count N-Grams for each text
> ngram-count -order 2 -text text-group-001.txt -write
> text-group-001.count
> ngram-count -order 2 -text text-group-002.txt -write
> text-group-002.count
>
> Merge N-Grams into a common file
> ngram-merge -write text-group.count -- text-group-001.count
> text-group-002.count
>
> Extract vocabulary and estimate GT parameters
> ngram-count -sort -read text-group.count
> -write-vocab text-group.vocab
> -gt1 text-group.gt1
> -gt2 text-group.gt2
>
> Build language model with -gt2min 1
> ngram-count -sort -order 2 -read text-group-001.count
> -vocab text-group.vocab
> -lm text-group-001-1.lm
> -gt2min 1
> -gt1 text-group.gt1 -gt2 text-group.gt2
>
> Build language model with -gt2min 2
> ngram-count -sort -order 2 -read text-group-001.count
> -vocab text-group.vocab
> -lm text-group-001-2.lm
> -gt2min 2
> -gt1 text-group.gt1 -gt2 text-group.gt2
>
> Apply the language model
> ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
> -debug 0
> ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
> -debug 0
>
> The problem is that the resulting language models text-group-001-1.lm and
> text-group-001-2.lm are the same. Hence apply them to new texts results in
> the same values.
>
> Both texts contains n-grams which occur once and other ones which occur
> twice or more.
>
> What are we doing wrong?
>
> We appreciate your help!
>
> Claude
>
>
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
More information about the SRILM-User
mailing list