[SRILM User List] Word Guesser

Fri Nov 20 15:21:07 PST 2009

John Day wrote:
> Hi,
> I'm generating some models for guessing the next word in a sequence, 
> using a text file to generate the language model. Currnently I'm using 
> "Alice in Wonderland" as the training text. For example, the words 
> "off with" are generally followed by "his" or "her" in the text:
>  
> />grep -i "off with" ..\alice.phrases.txt
> Hes murdering the time Off with his head How dreadfully savage 
> exclaimed Alice
> screamed Off with her head Off Nonsense said Alice
> Off with their heads and the procession moved on
> and shouting Off with his head or Off with her head about once in a minute
> Off with his head she said
> and shouting Off with his head or Off with her head Those whom she 
> sentenced were taken in
> to custody by the soldiers
> Behead that Dormouse Turn that Dormouse out of court Suppress him 
> Pinch him Off with his whiskers For some minutes the whole court was 
> in confusion
> Off with her head the Queen shouted at the top of her voice/
> // 
> I trained a model using this text and generate sentences by iterating 
> though the lm->vocab calling wordProb() twice, once with no context 
> and then with the prefix words "off with"
>             p1 = wordProb(word, NULL);
>             p2 = wordProb(word, prefix);
> I notice that the probability changes when i do this, but the 
> difference in probabilities is the same regardless of the prefix (-sw 
> = starting words):
>

I suspect you somehow invoke the wordProb function incorrectly, or don't 
construct the context appropriately.
(For example, it doesn't work to pass a NULL value as the context 
argument, but maybe you didn't mean that literally.)

I would first verify the probabilities without writing any code.  You 
can simply pass N-grams (with count 1 in the last position) to ngram 
-debug 2 -counts FILE . This allows you to check the conditional 
probabilities for different words and contexts.
If that checks out you can test your own code and make sure it yields 
the same probabilities.

Andreas

>  
> guess -gen10 -sw"off" alice.phrases.txt.lm
>
> COUNT: 2815
>
> quarrelling,-2.0320
>
> staring,-1.9070
>
> outside,-1.8101
>
> writing,-1.8101
>
> together,-1.8101
>
> sneezing,-1.6640
>
> shouted,-1.5091
>
> with,-1.2514
>
> from,-1.2419
>
> being,-1.2081
>
>  
>
> guess -gen10 -sw"off with " alice.phrases.txt.lm
>
> COUNT: 2815
>
> quarrelling,-2.0320
>
> staring,-1.9070
>
> outside,-1.8101
>
> writing,-1.8101
>
> together,-1.8101
>
> sneezing,-1.6640
>
> shouted,-1.5091
>
> with,-1.2514
>
> from,-1.2419
>
> being,-1.2081
>
>  
>
> guess -gen10 -sw"off with her " alice.phrases.txt.lm
>
> COUNT: 2815
>
> quarrelling,-2.0320
>
> staring,-1.9070
>
> outside,-1.8101
>
> writing,-1.8101
>
> together,-1.8101
>
> sneezing,-1.6640
>
> shouted,-1.5091
>
> with,-1.2514
>
> from,-1.2419
>
> being,-1.2081
>
>  
>
> Why doesn't this work? Is the language model to small? Is this the 
> correct way to compute the conditioning effect of prefix words on a 
> word probability? Is there a better way to do this?
>
>  
>
> Thanks,
>
> John Day
>
> Palm Bay, Florida
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user