[SRILM User List] Fwd: Batch no-sos and no-eos

Sat Jul 28 03:09:18 PDT 2012

Hi

I need to make a language model from a set of 5000+ texts. The texts
are separated into one sentence per line so there are a lot of
sentence boundary tokens which I need to get rid of.

I used make-batch-counts and merge-batch counts to count the ngrams,
and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
still sentence boundaries we're included.

I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
the same results.

Removing '\n' from the text files resulted in "line 1: line too long".

I tried ngram-count with -no-eos -no-sos on one of the files and it
worked, but on a batch it didn't seem to work.

Any ideas on what I should try next ?

Thanks
--
Alexandru Tomescu, undergraduate Computer Science student at
Polytechnic University of Bucharest