[SRILM User List] Restricting language model on n-gram types with at least n occurrences
claude.vividsky at gmail.com
claude.vividsky at gmail.com
Thu Aug 23 05:35:56 PDT 2018
Hello,
we would like to build a language model where only n-gram types with at
least 2 or more occurrences are regarded. N-grams which occur only once are
to be discarded.
We use the following workflow (example for bigrams)
Count N-Grams for each text
ngram-count -order 2 -text text-group-001.txt -write
text-group-001.count
ngram-count -order 2 -text text-group-002.txt -write
text-group-002.count
Merge N-Grams into a common file
ngram-merge -write text-group.count -- text-group-001.count
text-group-002.count
Extract vocabulary and estimate GT parameters
ngram-count -sort -read text-group.count
-write-vocab text-group.vocab
-gt1 text-group.gt1
-gt2 text-group.gt2
Build language model with -gt2min 1
ngram-count -sort -order 2 -read text-group-001.count
-vocab text-group.vocab
-lm text-group-001-1.lm
-gt2min 1
-gt1 text-group.gt1 -gt2 text-group.gt2
Build language model with -gt2min 2
ngram-count -sort -order 2 -read text-group-001.count
-vocab text-group.vocab
-lm text-group-001-2.lm
-gt2min 2
-gt1 text-group.gt1 -gt2 text-group.gt2
Apply the language model
ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
-debug 0
ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
-debug 0
The problem is that the resulting language models text-group-001-1.lm and
text-group-001-2.lm are the same. Hence apply them to new texts results in
the same values.
Both texts contains n-grams which occur once and other ones which occur
twice or more.
What are we doing wrong?
We appreciate your help!
Claude
More information about the SRILM-User
mailing list