[SRILM User List] ARPA format for Ngram LMs with Jelinek-Mercer smoothing

Fri May 27 14:22:29 PDT 2011

Ariya Rastrow wrote:
>
> On Mon, May 23, 2011 at 1:45 AM, Andreas Stolcke 
> <stolcke at icsi.berkeley.edu <mailto:stolcke at icsi.berkeley.edu>> wrote:
>
>     Ariya Rastrow wrote:
>
>         Hi,
>          I have a question regarding building N-gram LMs with
>         Jelinek-Mercer smoothing. I have optimized the weights using
>         my own scripts on some held-out data and now I am trying to
>         write out the ARPA backoff format of the LM. I have the N-gram
>         probabilities and the corresponding weights for 1grams,2grams
>         and 3grams. I was wondering if I could use SRILM toolkit to
>         get the ARPA representation of my LM. I have tried ngram
>         script with -count-lm option along with -write but then the
>         script only writes out the lm as a header file which is
>         described under -count-lm option. I know this is an easy task
>         and one can use the weights as the backoff weights to get the
>         ARPA format. Any help would be appreciated.
>
>     If you know how to create the count-LM then you're halfway there.
>
>     To get a backoff LM you can first train a backoff LM using one of
>     the standard LM smoothing methods (say GT, the default), then use
>     the count-LM (previously created) to "rescore" the probabilities
>     in the backoff LM (ngram -rescore-ngram option).    However, be
>     aware this only approximates the interpolated LM, but the
>     approximation is exact for all ngrams contained in the training data.
>
>     Andreas
>
> The reason I wanted to get ARPA format for Jelinek-Mercer smoothed LM 
> was to be able to load it in a c++ code. I understand the ARPA format 
> would be an approximation as you mentioned. Can you please let me know 
> what the best way would be to load the N-grams and their probabilities 
> along with the interpolation weights in a c++ code and perhaps do the 
> interpolation on the fly? Basically my question is how to use 
> Jelinek-Mercer LM in a c++ code given the fact that I already have the 
> weights and N-gram probabilities (I can make the header file as in 
> -count-lm)?
The whole point of SRILM is to be able to link with C++ through the API.
You just need to instantiate the Vocab class and the NgramCountLM class,
invoke the read() method, and then use wordProb() function to obtain 
conditional probabilities.
The man pages for Vocab(3) and LM(3) describe the interface.

Andreas

>
> Thanks,
> Ariya