[SRILM User List] Do you have character level for LM in SRILM toolkit

Anand Venkataraman venkataraman.anand at gmail.com
Sun Jan 6 09:53:17 PST 2013


If you want an LM built over character sequences, you simply have to break
your input stream into whitespace separated letters. Note that there are
many nuances here - E.g. Should you have new word boundary characters
(analogous to <s> and </s>) or simply have one word per line, have the
fixed vocabulary (the alphabet + any meta chars you want) given up front or
learned at build-time, how you handle special characters and punctuations,
etc.

Assuming English text, the following Unix command can get you started. The tr
command breaks the stream into one word per line and sed inserts a space
after every letter on each line.

cat corpus.txt | tr ' ' '\012' | sed 's/\(.\)/\1 /g' | ngram-count ...


HTH

&

On Sun, Jan 6, 2013 at 8:56 AM, Nutthamon <dcherubangel at gmail.com> wrote:

> Hello,
>
> I am new to language modeling and SRILM toolkit.
>
> Is this toolkit can generate language model in character level? If can do that, what is a command for do that i can't find it.And please give example to me.
>
> Many thank in advance
>
>
> --
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130106/4d638427/attachment.html>


More information about the SRILM-User mailing list