[SRILM User List] lattice-tool related issues
Andreas Stolcke
stolcke at speech.sri.com
Fri Jan 28 21:49:48 PST 2011
Sorry for not responding earlier to this.
The latest version has a new lattice-tool option : -zeroprob-word . It
allows you to avoid assigning zero probabilities to OOV words without
mapping them to <unk> .
Andreas
Anoop Deoras wrote:
>
> On Aug 3, 2010, at 7:29 PM, Andreas Stolcke wrote:
>
>>
>> In message <5D1CA95A-F9E4-417E-B276-DE8056B3F254 at jhu.edu>you wrote:
>>> Hello,
>>>
>>> I am trying to rescore htk lattices using lattice-tool and am
>>> running into following issues:
>>>
>>> 1. I pass a 3gm language model and a vocabulary file to rescore the
>>> lattice (encoding bigram information) and
>>> then write back the updated and expanded lattice back in the htk
>>> format.
>>>
>>> However, when I specify -unk and -keep-unk flags, the OOV words gets
>>> mapped to unk without preserving the
>>> original label. I was under the impression that -keep-unk would
>>> preserve the label of the OOV word, but it does not do so.
>>
>> I just looked at the code, and it seems that -keep-unk is only
>> implemented
>> when reading HTK format lattices, not for PFSGs.
>> Is that what you are using?
>>
>> If you are using HTK lattices then please prepare some small input data
>> files that demonstrate the problem, and I can look into it when I get
>> a chance.
>>
>
> Hi Andreas,
>
> I am, infact, using HTK lattices. I was doing some debugging myself
> and noticed
> that when the rescoring LM is of the same order as that of the lattice
> (i.e. if the
> lattice expansion is not required), then -keep-unk works fine. When I
> use a higher
> order LM, it fails. I have uploaded the data at:
>
> <URL>
>
> Please run RescoreLattice.sh to process the HTK lattice file. I have
> kept the
> necessary vocabulary and trigram and bigram LM files too (Note: input
> lattices
> encodes bigram history and hence a trigram rescoring LM expands the
> lattice)
>
> The word 'slash' is out of vocabulary. A bigram rescoring keeps it intact
> while trigram rescoring maps it to <unk>
>
>
>>>
>>> 2. Before I rescore the lattice, I want to split some words (multiword
>>> units). The multiwords are connected by an
>>> underscore character. I hence use the flags, -split-multiwords -multi-
>>> char _
>>>
>>> All goes well, as long as I do not use -unk -keep-unk flag in
>>> conjunction with -split-multiwords . If I use -unk -keep-unk flag
>>> (for point 1 above) and also use -split-multiwords flags, then the
>>> multiword functionality does not work moreover the OOV
>>> words get mapped to <unk>.
>>>
>>> I should point out that the multi-word unit is NOT in my vocabulary
>>> but after the split, all the individual words are found
>>> in the vocabulary. Hence, I am suspecting that the functionality for
>>> the flag -unk takes place before the splitting
>>> and since no multiword unit is in the vocabulary, the -split-
>>> multiwords functionality does not have
>>> anything to split.
>>>
>>> I was wondering if there is anyway we can invoke split-multiword
>>> functionality before mapping
>>> unk words ?
>>
>> The way it works is that upon reading the lattice (before any operation
>> on them), word labels are converted to integers. Normally a new word
>> generates a new integer autoamtically, but with -unk and -keep-unk
>> unknown words are mapped to the <unk> integer code.
>>
>> So therefore, the splitting won't work if the multiwords themselves
>> are not in the vocabulary.
>>
>> A workaround is to do the multiword splitting in a separate processing
>> pass, where lattice-tool is invoked WITHOUT -unk.
>>
>> Andreas
>
> Yes, that makes sense. Thank you.
>
> -Anoop
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list