[SRILM User List] Predicting words

Mon Sep 3 05:37:16 PDT 2012

FYI, for others on the list and the archives--

After talking to Federico offline, I think he ended up solving his
problem by using the Python bindings I wrote a while back to query the
ngram model directly. Since they might be useful to others I went
ahead and uploaded them to github as well:
  https://github.com/njsmith/pysrilm
  Download snapshot: https://github.com/njsmith/pysrilm/zipball/master

-- Nathaniel Smith
University of Edinburgh

On Sun, Sep 2, 2012 at 11:59 AM, Federico Sangati
<federico.sangati at gmail.com> wrote:
> Hi,
>
> Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below).
>
> MAPFILE:
> shock   shock
> 1961    1961
> … [same for all words occurring in vocabulary]
> UNK_NEXT_WORD   maturing analyzing attended … [list of all words occurring in vocabulary]
>
> INPUTFILE:
> No , UNK_NEXT_WORD
> <s> No , UNK_NEXT_WORD
> But while , UNK_NEXT_WORD
> <s> But while , UNK_NEXT_WORD
> The 49 stock specialist UNK_NEXT_WORD
>
> OUTPUTFILE:
> <s> No , talent </s>
> <s> No , talent </s>
> <s> But while , talent </s>
> <s> But while , talent </s>
> <s> The 49 stock specialist talent </s>
>
> Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters).
> It would be nice to know if there is any solution for this.
>
> Best,
> Federico Sangati
> University of Edinburgh
>
>
>> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
>> Indeed you can use disambig, at least in theory to solve this problem.
>>
>> 1. prepare a map file of the form:
>>
>>     a       a
>>     man    man
>>     ...   [for all words occurring in your data]
>>     UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary
>> here]
>>
>> 2. train an LM of word sequences.
>>
>> 3. prepare disambig input of the form
>>
>>                 a man is sitting UNKNOWN_WORD
>>
>>    You can also add known words to the right of UKNOWN_WORD if you have
>> that information (see the note about -fw-only below).
>>
>> 4. run disambig
>>
>>             disambig -map MAPFILE -lm LMFILE -text INPUTFILE
>>
>> If you want to use only the left context of the UNKNOWN_WORD use the
>> -fw-only option.
>>
>> This is in theory.  If your vocabulary is large it may be very slow and
>> take too much memory.  I haven't tried it, so let me know if it works
>> for you.
>>
>> Andreas
>
>>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>>> Hello,
>>> I am new to language modeling and was hoping that someone can help me with the following.
>>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user