[SRILM User List] Train lm character level

Anand Venkataraman venkataraman.anand at gmail.com
Wed Jan 9 10:25:55 PST 2013


It depends on what you want to accomplish with the LM. Under most
circumstances you would want to preserve the word boundary information
(akin to the sentence boundary tags - <s> and </s>, which stands for the
start and  end of a sentence).

The first format you describe (training.txt) accomplishes this by using <s>
and </s> to proxy for your word boundaries. But it loses information which
you might otherwise have obtained from knowledge of which words are likely
to occupy which sentential slots (e.g. the is almost invariably followed by
another word and hence <w> should be more likely after "the" than after,
say, an arbitrary noun.) You could introduce <w> and </w> as special tokens
in training2.txt, for instance.

&

On Tue, Jan 8, 2013 at 8:19 PM, Koonnoo <dcherubangel at gmail.com> wrote:

> Dear All
> i used this tool via cygwin terminal.
>
> Example in training.txt
>
> s i m p l y
> g o o d
> t h a n k y o u
> c l o u n d
>
> or
> training2.txt
> s i m p l y g o o d t h a n k y o u c l o u n d
>
>
> which training text correct for LM built on character level? first,right?
> If first i can directly enter to add more line or add some symbol for add
> line?
>
> i'm not sure what is <s>and</s> mean.
>
> Is this command for train lm model character level (trigram)?
> $ ngram-count -text /srilm/training.txt -order 3 -lm /srilm/training.lm
>
> My english is weak maybe i ask you more than 1 time :)
> thank you in advance
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130109/9d949691/attachment.html>


More information about the SRILM-User mailing list