[SRILM User List] Fwd: Batch no-sos and no-eos

Andreas Stolcke stolcke at icsi.berkeley.edu
Sat Jul 28 09:46:05 PDT 2012


On 7/28/2012 3:09 AM, Alex Tomescu wrote:
> Hi
>
> I need to make a language model from a set of 5000+ texts. The texts
> are separated into one sentence per line so there are a lot of
> sentence boundary tokens which I need to get rid of.
>
> I used make-batch-counts and merge-batch counts to count the ngrams,
> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
> still sentence boundaries we're included.
I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true 
that <s> and </s> appear in the unigram section of the LM (they are 
still part of the vocabulary, similar to other words that might occur in 
your vocab file but don't occur in your training data), but there are 
not higher-order order N-gram involving <s> or </s> in the resulting LM.

The same is true if you run ngram-count -no-sos -no-eos, so the two ways 
of building the LM are consistent in this regard.

Presently,  -no-sos -no-eos just affect the way ngrams are generated 
from text.   After counts are extracted, they don't affect any part of 
the LM building process.   It might make sense for these options to also 
modify the default vocab membership or <s> and </s>.  Having the tags in 
the vocab without N-grams should be fine for most LM uses, but I can see 
an argument for removing them. Is that the behavior you are looking for?

Andreas


>
> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
> the same results.
>
> Removing '\n' from the text files resulted in "line 1: line too long".
>
> I tried ngram-count with -no-eos -no-sos on one of the files and it
> worked, but on a batch it didn't seem to work.
>
> Any ideas on what I should try next ?
>
> Thanks
> --
> Alexandru Tomescu, undergraduate Computer Science student at
> Polytechnic University of Bucharest
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user



More information about the SRILM-User mailing list