[SRILM User List] lattice-tool related issues
Anoop Deoras
adeoras at jhu.edu
Tue Aug 3 18:02:03 PDT 2010
On Aug 3, 2010, at 7:29 PM, Andreas Stolcke wrote:
>
> In message <5D1CA95A-F9E4-417E-B276-DE8056B3F254 at jhu.edu>you wrote:
>> Hello,
>>
>> I am trying to rescore htk lattices using lattice-tool and am
>> running into following issues:
>>
>> 1. I pass a 3gm language model and a vocabulary file to rescore the
>> lattice (encoding bigram information) and
>> then write back the updated and expanded lattice back in the htk
>> format.
>>
>> However, when I specify -unk and -keep-unk flags, the OOV words gets
>> mapped to unk without preserving the
>> original label. I was under the impression that -keep-unk would
>> preserve the label of the OOV word, but it does not do so.
>
> I just looked at the code, and it seems that -keep-unk is only
> implemented
> when reading HTK format lattices, not for PFSGs.
> Is that what you are using?
>
> If you are using HTK lattices then please prepare some small input
> data
> files that demonstrate the problem, and I can look into it when I
> get a chance.
>
Hi Andreas,
I am, infact, using HTK lattices. I was doing some debugging myself
and noticed
that when the rescoring LM is of the same order as that of the lattice
(i.e. if the
lattice expansion is not required), then -keep-unk works fine. When I
use a higher
order LM, it fails. I have uploaded the data at:
<URL>
Please run RescoreLattice.sh to process the HTK lattice file. I have
kept the
necessary vocabulary and trigram and bigram LM files too (Note: input
lattices
encodes bigram history and hence a trigram rescoring LM expands the
lattice)
The word 'slash' is out of vocabulary. A bigram rescoring keeps it
intact
while trigram rescoring maps it to <unk>
>>
>> 2. Before I rescore the lattice, I want to split some words
>> (multiword
>> units). The multiwords are connected by an
>> underscore character. I hence use the flags, -split-multiwords -
>> multi-
>> char _
>>
>> All goes well, as long as I do not use -unk -keep-unk flag in
>> conjunction with -split-multiwords . If I use -unk -keep-unk flag
>> (for point 1 above) and also use -split-multiwords flags, then the
>> multiword functionality does not work moreover the OOV
>> words get mapped to <unk>.
>>
>> I should point out that the multi-word unit is NOT in my vocabulary
>> but after the split, all the individual words are found
>> in the vocabulary. Hence, I am suspecting that the functionality for
>> the flag -unk takes place before the splitting
>> and since no multiword unit is in the vocabulary, the -split-
>> multiwords functionality does not have
>> anything to split.
>>
>> I was wondering if there is anyway we can invoke split-multiword
>> functionality before mapping
>> unk words ?
>
> The way it works is that upon reading the lattice (before any
> operation
> on them), word labels are converted to integers. Normally a new word
> generates a new integer autoamtically, but with -unk and -keep-unk
> unknown words are mapped to the <unk> integer code.
>
> So therefore, the splitting won't work if the multiwords themselves
> are not in the vocabulary.
>
> A workaround is to do the multiword splitting in a separate processing
> pass, where lattice-tool is invoked WITHOUT -unk.
>
> Andreas
Yes, that makes sense. Thank you.
-Anoop
More information about the SRILM-User
mailing list