problem with multiwords splitting and meshes

Andreas Stolcke stolcke at speech.sri.com
Thu Sep 20 12:22:42 PDT 2007


> 
> Dear Andreas,
> I am working with word-lattices containing multi-words.
> 
> I need to extract meshes from them,
> but I noticed a wrong behavior by just using
> the parameters "-split-multiwords"
> 
> This is due to the fact, I think, that the additional nodes are set  
> with "wrong" timestamps
> (equal to the timestamp of the original endnode) as I can see when  
> saving in htk format instead.

Yes, that is expected.  If you have no sub-word (phone) alignment information,
there is no way to assign time stamps to the components of a multiword.

> 
> This fact should be solved by version 1.5.3 by means of parameter "- 
> multiword-dictionary".

That's what it was made for.

> Unfortunately I am not able to use it correctly.
> 
> I run the following command
> 
> cat example.lat | lattice-tool -htk-acscale 1 -htk-lmscale 14.766 - 
> htk-wdpenalty -3 -in-lattice - -read-htk -out-lattice - -write-htk - 
> split-multiwords -multiword-dictionary multiword.lexicon
> 
> and I got the following error message
> 
> Lattice::splitHTKMultiwordNodes: no pronunciation on multiword node  
> we_will
> 
> I attached a very small (artificial) lattice "example.lat" and a real  
> lattice "example2.lat".
> 
> The file multiword.lexicon contains lines like the following
> we_will w iy | w el
> 
> 
> So I would ask you if you can please help me.
> 
> Specifically, I have some specific questions
> 
> - Is the format of the file with the multiword lexicon correct

Yes.

> - Do I need also the lexicon dictionary? Something like the following?
> we w iy
> will w el

No.

> - Do I miss anything else?

Yes.  Look at the error message: "no pronunciation on multiword node".
If you have no pronunciation information in the original lattice you cannot
infer the alignment of the split multiword.

The pronunciation and phone alignment format for HTK lattices may not be
well documented.   It consists of a string of phone labels and durations 
separated by commas and colons.  In your case, the node for we_will would 
need to look like this:

J=1 S=0 E=1 W=we_will v=3 a=-200 l=-4 d=:w,0.1:iy,0.2:w,0.1:el:0.2:

AND the phone string needs to correspond exactly to an entry in your
multiword dictionary with boundary marker (as it does in this case).

I have no idea how you would get your decoder to output this information.
You might be able to "fake it" by 
(1) looking up the pronuncation variant (3 in this case) in your decoding
dictionary, and (2) making assumptions about the relative durations of the
phones  (you can get the total word duration from the lattice node times).
You would then have to insert properly formatted "d=" fields into the 
lattices before sending the lattice to lattice-tool.

> - What happens to the scores of the edge corresponding to the multiword?

All the scores are retained on the first multiword component, the remaining
components get 0 scores (so the total scores along the path is unchanged).

> In other words, how can I generate a new lattice with multiwords  
> splitted over several edges,
> containing "correct" scores and times,  somehow proportional to the  
> "length" of each component word?

If you want to split multiword nodes using a different strategy from what
is described above you can implement it yourself, either as a preprocessing
step or by modify ing the function Lattice::splitHTKMultiwordNodes() in
lattice/src/HTKLattice.cc .

Andreas




More information about the SRILM-User mailing list