[SRILM User List] Word Guesser

John Day af4ex.radio at yahoo.com
Wed Nov 18 13:19:16 PST 2009


Hi,
I'm generating some models for guessing the next word in a sequence, using a text file to generate the language model. Currnently I'm using "Alice in Wonderland" as the training text. For example, the words "off with" are generally followed by "his" or "her" in the text:
 
>grep -i "off with" ..\alice.phrases.txt
Hes murdering the time Off with his head How dreadfully savage exclaimed Alice
screamed Off with her head Off Nonsense said Alice
Off with their heads and the procession moved on
and shouting Off with his head or Off with her head about once in a minute
Off with his head she said
and shouting Off with his head or Off with her head Those whom she sentenced were taken in
to custody by the soldiers
Behead that Dormouse Turn that Dormouse out of court Suppress him Pinch him Off with his whiskers For some minutes the whole court was in confusion
Off with her head the Queen shouted at the top of her voice
 
I trained a model using this text and generate sentences by iterating though the lm->vocab calling wordProb() twice, once with no context and then with the prefix words "off with" 
            p1 = wordProb(word, NULL);
            p2 = wordProb(word, prefix);
I notice that the probability changes when i do this, but the difference in probabilities is the same regardless of the prefix (-sw = starting words):
 
guess -gen10 -sw"off" alice.phrases.txt.lm 
COUNT: 2815
quarrelling,-2.0320
staring,-1.9070
outside,-1.8101
writing,-1.8101
together,-1.8101
sneezing,-1.6640
shouted,-1.5091
with,-1.2514
from,-1.2419
being,-1.2081
 
guess -gen10 -sw"off with " alice.phrases.txt.lm 
COUNT: 2815
quarrelling,-2.0320
staring,-1.9070
outside,-1.8101
writing,-1.8101
together,-1.8101
sneezing,-1.6640
shouted,-1.5091
with,-1.2514
from,-1.2419
being,-1.2081
 
guess -gen10 -sw"off with her " alice.phrases.txt.lm 
COUNT: 2815
quarrelling,-2.0320
staring,-1.9070
outside,-1.8101
writing,-1.8101
together,-1.8101
sneezing,-1.6640
shouted,-1.5091
with,-1.2514
from,-1.2419
being,-1.2081
 
Why doesn't this work? Is the language model to small? Is this the correct way to compute the conditioning effect of prefix words on a word probability? Is there a better way to do this?
 
Thanks,
John Day
Palm Bay, Florida


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20091118/7ea14f2a/attachment.html>


More information about the SRILM-User mailing list