[SRILM User List] Count number of n-grams common to a text and a language model
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Jun 13 14:28:06 PDT 2017
Use the output of ngram -debug 2 -ppl ...
For each word, it will output a line containing tokens of the form
[2gram]
[3gram]
etc.
These indicate whether the current input word and its context was
matched by a bigram, trigram, etc. in the model.
So you can tally these up and compute the percentage of test set ngrams
that were matched by different order of model ngrams.
(A match by an N-gram implies a match by all lower-order N-1, N-2, etc.
grams.)
Now, these would give you the TOKEN frequencies of ngram matches. (If an
ngram occurs multiple times in the test set, it counts multiple times.)
If instead you want to compute TYPE frequencies of ngram matches you'd
have to extract the ngrams from the model file, then from the test set
(using ngram-count), and compute the intersection.
Andreas
On 6/13/2017 9:17 AM, claude.vividsky at gmail.com wrote:
> Hi,
>
> is there a way to extract the number of n-grams which are
> both in the language model and the text under test?
>
> Does I need to set special parameters to ngram-count when
> the language model is generated and to ngram when the
> language model is applied to the text under test?
>
> Is there a way to extract or calculate this number from
> the output of ngram?
>
> Thank you
> Claude
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
More information about the SRILM-User
mailing list