[SRILM User List] Fwd: Batch no-sos and no-eos
Alex Tomescu
alex.dan.tomescu at gmail.com
Sat Jul 28 03:09:18 PDT 2012
Hi
I need to make a language model from a set of 5000+ texts. The texts
are separated into one sentence per line so there are a lot of
sentence boundary tokens which I need to get rid of.
I used make-batch-counts and merge-batch counts to count the ngrams,
and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
still sentence boundaries we're included.
I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
the same results.
Removing '\n' from the text files resulted in "line 1: line too long".
I tried ngram-count with -no-eos -no-sos on one of the files and it
worked, but on a batch it didn't seem to work.
Any ideas on what I should try next ?
Thanks
--
Alexandru Tomescu, undergraduate Computer Science student at
Polytechnic University of Bucharest
More information about the SRILM-User
mailing list