[SRILM User List] Fwd: Batch no-sos and no-eos

Alex Tomescu alex.dan.tomescu at gmail.com
Sun Jul 29 03:46:21 PDT 2012


Hello,

> I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true that <s> and </s> appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving <s> or </s> in the resulting LM.


These are the exact parameters I passed to make-big-lm, and still I
looked through the LM and there are ngrams containing </s>
("-0.0009011862   <PERIOD> </s>")

make-big-lm -name biglm -read merge-iter9-1.ngrams.gz -lm gut.lm
-no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab
using existing gtcounts
warning: discount coeff 1 is out of range: 1.1758
warning: discount coeff 3 is out of range: 1.11643
warning: discount coeff 5 is out of range: 1.17202
warning: discount coeff 7 is out of range: 1.12503
+ ngram-count -read - -read-with-mincounts -order 3 -gt1 biglm.gt1
-gt2 biglm.gt2 -gt3 biglm.gt3 -lm gut.lm -no-eos -no-sos -prune 1e-8
-vocab ../gut.vocab -limit-vocab -meta-tag __meta__

It's really weird because when I tried ngram-count on a single file
(very similar to the one triggered by make-big-lm), eos and sos tokens
were only included in the unigrams.

> Presently,  -no-sos -no-eos just affect the way ngrams are generated from text.   After counts are extracted, they don't affect any part of the LM building process.   It might make sense for these options to also modify the default vocab membership or <s> and </s>.  Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for?


It's ok if they are included as unigrams.

I am going to make some more tests and if I find the problem I will
post it. For the moment I can work around this by making bigger
paragraphs so that there are not so many eos and sos tags.

Thank you,

Alex

On Sat, Jul 28, 2012 at 7:46 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
>
> On 7/28/2012 3:09 AM, Alex Tomescu wrote:
>>
>> Hi
>>
>> I need to make a language model from a set of 5000+ texts. The texts
>> are separated into one sentence per line so there are a lot of
>> sentence boundary tokens which I need to get rid of.
>>
>> I used make-batch-counts and merge-batch counts to count the ngrams,
>> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
>> still sentence boundaries we're included.
>
> I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true that <s> and </s> appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving <s> or </s> in the resulting LM.
>
> The same is true if you run ngram-count -no-sos -no-eos, so the two ways of building the LM are consistent in this regard.
>
> Presently,  -no-sos -no-eos just affect the way ngrams are generated from text.   After counts are extracted, they don't affect any part of the LM building process.   It might make sense for these options to also modify the default vocab membership or <s> and </s>.  Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for?
>
> Andreas
>
>
>>
>> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
>> the same results.
>>
>> Removing '\n' from the text files resulted in "line 1: line too long".
>>
>> I tried ngram-count with -no-eos -no-sos on one of the files and it
>> worked, but on a batch it didn't seem to work.
>>
>> Any ideas on what I should try next ?
>>
>> Thanks
>> --
>> Alexandru Tomescu, undergraduate Computer Science student at
>> Polytechnic University of Bucharest
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>



--
Alexandru Tomescu, undergraduate Computer Science student at
Polytechnic University of Bucharest



More information about the SRILM-User mailing list