[SRILM User List] Fwd: Batch no-sos and no-eos

Tony Robinson tonyr at cantabresearch.com
Sat Jul 28 04:16:20 PDT 2012


Hi Alex,

<s> and </s> are not really "sentence boundary" tokens, even though 
that's what everyone calls them and that's how they are used most of the 
time.   They are for the start and end of utterance contexts.    So for 
your problem pick a suitably large chunk - let's say we decode a chapter 
at a time and have a <s> at the start and a </s> at the end and replace 
the rest with <PERIOD>.

I'm back, so mail me if this doesn't make sense.


Tony

On 07/28/2012 11:09 AM, Alex Tomescu wrote:
> Hi
>
> I need to make a language model from a set of 5000+ texts. The texts
> are separated into one sentence per line so there are a lot of
> sentence boundary tokens which I need to get rid of.
>
> I used make-batch-counts and merge-batch counts to count the ngrams,
> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
> still sentence boundaries we're included.
>
> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
> the same results.
>
> Removing '\n' from the text files resulted in "line 1: line too long".
>
> I tried ngram-count with -no-eos -no-sos on one of the files and it
> worked, but on a batch it didn't seem to work.
>
> Any ideas on what I should try next ?
>
> Thanks
> --
> Alexandru Tomescu, undergraduate Computer Science student at
> Polytechnic University of Bucharest
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


-- 
Dr A J Robinson, Founder and Director of Cantab Research Limited.
St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
Company reg no 05697423 (England and Wales), VAT reg no 925606030.


More information about the SRILM-User mailing list