[SRILM]: -debug 2 info

Sat Jul 8 20:50:35 PDT 2006

ilya oparin wrote:
> Hi!
>
> When I calculate perplexity of my POS-based class model (word can 
> belong to many classes, class-definition file I create myself on the 
> base of a POS-tagged data), with "-debug 2" I get the output I can not 
> fully understand. For testing puropses I measure ppl on the same data 
> I trained the class model (i.e. there should not be ay OOVs). However, 
> in the debug output, for every N-gram there is a string of the format
> P(w| w...) = [OOV][n-gram][n-gram]...[OOV][n-gram][n-gram]...
> As far as I get it, [n-gram]s refer to different combinations of 
> assigning words to classes. But why fo those [OOV] may appear (and 
> they appear in equal intervals between strings of [n-gram]s for each 
> word)?

The stuff in brackets refers to ngram lookups for various class 
memberships. The first bracket
refers to the ngram lookup where no class membership is involved, i.e., 
the word itself is
used in the last ngram position (remember that in SRILM class-based LMs 
may contain
both word and class ngram in the same model,). So OOV here just means 
that no ngram containing the word directly is found.

--Andreas

>
> I have only one guess: since [OOVs] are only missing for the last 
> (</s>| ...) n-gram, those [OOV] may correspond to a check if a word is 
> present in the implicit stop-word vocabulary or something...
>
> It would be great if anybody could comment on that.
>
>
> best regards,
> Ilya
>
> ------------------------------------------------------------------------
> All New Yahoo! Mail 
> <http://us.rd.yahoo.com/mail/uk/taglines/default/nowyoucan/spamguard/*http://us.rd.yahoo.com/evt=40565/*http://uk.docs.yahoo.com/nowyoucan.html> 
> – Tired of Vi at gr@! come-ons? Let our SpamGuard protect you.