[SRILM User List] Predicting words

Federico Sangati federico.sangati at gmail.com
Sun Sep 2 03:59:16 PDT 2012


Hi,

Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below). 

MAPFILE:
shock	shock
1961	1961
… [same for all words occurring in vocabulary]
UNK_NEXT_WORD	maturing analyzing attended … [list of all words occurring in vocabulary]

INPUTFILE:
No , UNK_NEXT_WORD
<s> No , UNK_NEXT_WORD
But while , UNK_NEXT_WORD
<s> But while , UNK_NEXT_WORD
The 49 stock specialist UNK_NEXT_WORD

OUTPUTFILE:
<s> No , talent </s>
<s> No , talent </s>
<s> But while , talent </s>
<s> But while , talent </s>
<s> The 49 stock specialist talent </s>

Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters). 
It would be nice to know if there is any solution for this.

Best,
Federico Sangati
University of Edinburgh


> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
> Indeed you can use disambig, at least in theory to solve this problem.
> 
> 1. prepare a map file of the form:
> 
>     a       a
>     man    man
>     ...   [for all words occurring in your data]
>     UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary 
> here]
> 
> 2. train an LM of word sequences.
> 
> 3. prepare disambig input of the form
> 
>                 a man is sitting UNKNOWN_WORD
> 
>    You can also add known words to the right of UKNOWN_WORD if you have 
> that information (see the note about -fw-only below).
> 
> 4. run disambig
> 
>             disambig -map MAPFILE -lm LMFILE -text INPUTFILE
> 
> If you want to use only the left context of the UNKNOWN_WORD use the 
> -fw-only option.
> 
> This is in theory.  If your vocabulary is large it may be very slow and 
> take too much memory.  I haven't tried it, so let me know if it works 
> for you.
> 
> Andreas

>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>> Hello,
>> I am new to language modeling and was hoping that someone can help me with the following. 
>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).



More information about the SRILM-User mailing list