[SRILM User List] Predicting words

Tue Sep 4 17:10:13 PDT 2012

I suspect there were some problems with the construction of the map 
file.   For one thing, when you have a word that is also a valid numeric 
string (like the second line in your example) you cannot leave out the 
explicit mapping probability.
Also, it turns out that it is much more convenient to use the disambig 
-classes option instead of -map to supply the mapping information (this 
allows you to give the mapping one-word-at-a-time for the "unknown" token).

Anyway, here is a short example that demonstrates that my instructions 
worked in principle ;-).
It uses the trigram LM supplied with SRILM.

# construct the map file in classes format
ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz 
-write-vocab - | \
gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' lm.vocab > 
test.mapfile

# fill in the blanks (uses both left and right word context). Note 
-order 2 is default so specify -order 3
disambig -order 3 -classes test.mapfile -lm 
$SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -text -

INPUT:    what a great UNKNOWN-WORD
OUTPUT: <s> what a great time </s>
INPUT:   that is the stupidest UNKNOWN-WORD i've heard
OUTPUT: <s> that is the stupidest thing i've heard </s>

Seems to work ;-)

Andreas

On 9/2/2012 3:59 AM, Federico Sangati wrote:
> Hi,
>
> Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below).
>
> MAPFILE:
> shock	shock
> 1961	1961
> … [same for all words occurring in vocabulary]
> UNK_NEXT_WORD	maturing analyzing attended … [list of all words occurring in vocabulary]
>
> INPUTFILE:
> No , UNK_NEXT_WORD
> <s> No , UNK_NEXT_WORD
> But while , UNK_NEXT_WORD
> <s> But while , UNK_NEXT_WORD
>
> OUTPUTFILE:
> <s> No , talent </s>
> <s> No , talent </s>
> <s> But while , talent </s>
> <s> But while , talent </s>
>
>
> Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters).
> It would be nice to know if there is any solution for this.
>
> Best,
> Federico Sangati
> University of Edinburgh
>
>
>> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
>> Indeed you can use disambig, at least in theory to solve this problem.
>>
>> 1. prepare a map file of the form:
>>
>>       a       a
>>       man    man
>>       ...   [for all words occurring in your data]
>>       UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary
>> here]
>>
>> 2. train an LM of word sequences.
>>
>> 3. prepare disambig input of the form
>>
>>                   a man is sitting UNKNOWN_WORD
>>
>>      You can also add known words to the right of UKNOWN_WORD if you have
>> that information (see the note about -fw-only below).
>>
>> 4. run disambig
>>
>>               disambig -map MAPFILE -lm LMFILE -text INPUTFILE
>>
>> If you want to use only the left context of the UNKNOWN_WORD use the
>> -fw-only option.
>>
>> This is in theory.  If your vocabulary is large it may be very slow and
>> take too much memory.  I haven't tried it, so let me know if it works
>> for you.
>>
>> Andreas
>>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>>>   Hello,
>>> I am new to language modeling and was hoping that someone can help me with the following.
>>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).