[SRILM User List] lattice-tool related issues

Andreas Stolcke stolcke at speech.sri.com
Tue Aug 3 16:29:55 PDT 2010


In message <5D1CA95A-F9E4-417E-B276-DE8056B3F254 at jhu.edu>you wrote:
> Hello,
> 
>   I am trying to rescore htk lattices using lattice-tool and am  
> running into following issues:
> 
> 1. I pass a 3gm language model and a vocabulary file to rescore the  
> lattice (encoding bigram information) and
> then write back the updated and expanded lattice back in the htk format.
> 
> However, when I specify -unk and -keep-unk flags, the OOV words gets  
> mapped to unk without preserving the
> original label. I was under the impression that -keep-unk would  
> preserve the label of the OOV word, but it does not do so.

I just looked at the code, and it seems that -keep-unk is only implemented
when reading HTK format lattices, not for PFSGs.  
Is that what you are using? 

If you are using HTK lattices then please prepare some small input data
files that demonstrate the problem, and I can look into it when I get a chance.

> 
> 2. Before I rescore the lattice, I want to split some words (multiword  
> units). The multiwords are connected by an
> underscore character. I hence use the flags,  -split-multiwords -multi- 
> char _
> 
> All goes well, as long as I do not use -unk -keep-unk flag in  
> conjunction with -split-multiwords . If I use -unk -keep-unk flag
> (for point 1 above) and also use -split-multiwords flags, then the  
> multiword functionality does not work moreover the OOV
> words get mapped to <unk>.
> 
> I should point out that the multi-word unit is NOT in my vocabulary  
> but after the split, all the individual words are found
> in the vocabulary. Hence, I am suspecting that the functionality for  
> the flag -unk takes place before the splitting
> and since no multiword unit is in the vocabulary, the -split- 
> multiwords functionality does not have
> anything to split.
> 
> I was wondering if there is anyway we can invoke split-multiword  
> functionality before mapping
> unk words ?

The way it works is that upon reading the lattice (before any operation 
on them), word labels are converted to integers.  Normally a new word
generates a new integer autoamtically, but with -unk and -keep-unk 
unknown words are mapped to the <unk> integer code.

So therefore, the splitting won't work if the multiwords themselves
are not in the vocabulary.

A workaround is to do the multiword splitting in a separate processing 
pass, where lattice-tool is invoked WITHOUT -unk.

Andreas 

> 
> I apologize if I am not understanding the lattice-tool well enough and  
> am passing wrong arguments in the first place.
> 
> Thanks and Regards
> -Anoop
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user



More information about the SRILM-User mailing list