[SRILM-Announce] SRILM 1.7.0 released
Wen Wang
wwang at speech.sri.com
Mon Dec 24 16:16:14 PST 2012
Hi,
The latest version, 1.7.0, of SRILM is now available from
http://www.speech.sri.com/projects/srilm/download.html .
A list of changes appears below.
Enjoy and Merry Christmas!
Wen
1.7.0 23 December 2012
Functionality:
* ngram -codebook option for reading of Ngram LMs with
quantized parameters (contributed by Microsoft).
* ngram -msweb-lm option for obtaining LM probabilities from
the Microsoft Web N-gram service (web-ngram.research.microsoft.com). You need
to obtain a user ID to use this service, see man ngram for details
(contributed by Microsoft).
* Added support for dictionary-induced word distance metrics to
nbest-optimize (-dictionary option).
* Added support for matrix-defined word distance metrics to
nbest-optimize (-distances option).
* ngram -debug 4 -ppl outputs ranking statistics (number of
times correct word was in top 1, 5, 10), as well as quadratic and absolute
loss averages (based on code from Omid Madani).
* nbest-optimize accepts n-best list in SRInterp format and
generates SRInterp format rover-control file (weights file), when
-srinterp-format is specified.
* nbest-optimize accepts SRInterp counts file that contains
BLEU and TER counts info.
* lattice-tool -read-mesh will try to preserve acoustic information
(times, scores, pronunciations) if they are encoded in the
input confusion network.
* Support reading of text files in UTF-8 and UTF-16 encodings.
All string data is internally represented, and output, as ASCII/UTF-8
(contributed by Microsoft).
This feature uses the iconv library. Support for this feature
can be disabled by compiling with "NO_ICONV=anything" on the make
command line.
Portability:
* Ported LM client/server code to Winsock API (native socket
library in Windows), enabling this functionality for mingw and MSVC platforms
(contributed by Microsoft).
* Let machine-type script return 64bit platform names for Linux
and Solaris x86 when appropriate. This implies that 64bit binaries are
built by default on machines that support them.
* Array.h tweak for clang compiler (from kutlak.roman at gmail.com).
* Work around a namespace problem in C++11 (from
kutlak.roman at gmail.com).
* Use size_t for hash codes to ensure word width matches
pointer type.
* Fixes for mingw32 build, using Windows APIs for sockets and UTF
conversion (contributed by Microsoft).
* Support for 64bit mingw build (MACHINE_TYPE=win64).
* Updates for MacOSX (MACHINE_TYPE=macosx, thanks to Chuck
Wooters).
* Deal with nonportability of isfinite() and isnan().
* Changes for thread-safety (by Kyle McIntyre). See
doc/README-THREADS for details.
- Modified the remove() methods in various container classes to
return Boolean instead of a pointer to the removed element. The
removed element can be gotten with an optional reference argument. This
eliminates the need for a global static variable.
- Use STL sort() instead of qsort() in LHash and SArray sorted
iterations.
- Replaced all static variables with thread-local storage via
the TLSWrapper class, requiring the pthread library. This is available on most
platforms, but can be disabled at compile-time with -DNO_TLS.
Bug fixes:
* NgramLM backoff computation fixed to avoid spurious insertion
of nonzero unigram probabilities and non-unity backoff weights (resulting from
numerator/denominator values below Prob_Epsilon).
* lattice-tool does a better job inferring the lattice basename
from the UTTERANCE string embedded in HTK lattices.
* Trellis class: use a secondary sorting criterion to make
N-best output deterministic.
* WordMesh class: use posterior word probability to decide
which acoustic information to keep when merging hyps, instead of
duration-normalized acoustic stores as before. This leads to fewer words with
out-of-order timestamps when extracting one-best from confusion networks.
* fix-ctm script: Check for out-of-order word timestamps and
adjust them minimally as needed to produce a monotonic sequence, as
required for CTM sorting.
* Fixed bug in NgramCountLM estimation procedure reported by
ariya at jhu.edu.
* Allow ngram -hidden-vocab to read hidden event properties
described in man page.
* Fixed bug in ngram -hidden-vocab -write-lm output.
* Avoid crash when ngram -hidden-not -ppl is used with debug
level 2.
* Fixed (very rare) bug by which ngram -prune might remove all
ngrams sharing a common context.
* Improved ngram -prune-lowprobs by also removing backoff
weights that have become useless (suggested by Arlo Faria).
* Check for successful search for HTK lattice start/end nodes,
if not explicitly specified (reported by nshmyrev at yandex.ru).
* Handle infinity scores in lattice rescoring, and catch NaN
scores when reading HTK lattices.
* make-kn-discounts checks for negative discount values and reports
error if appropriate.
* nbest-optimize accepts combined BLEU and error rate objective
via switch -error-bleu-ratio R (R specifies the error rate weight).
* lattice-tool -timeout option now uses sigsetjmp/siglongjmp to
handle timeout alarms. This is necessary in Linux-compatible
(including cygwin) systems to handle alarms repeatedly.
* Fixed a bug reading NBestList2.0 format without phone
information (led to malformed confusion network output).
* Fixed a bug in Ngram::contextID() that was causing incorrect
expansion of lattices with pruned backoff models.
* Fixed a bug in the lattice-tool -keep-unk implementation that was
sometimes allowing an OOV word label to be output as <unk>.
* Removed some pseudo-randomness in ngram-class so that results
are more invariant to OPTION setting and platform properties.
* Avoid differences due to machine arithmetic in word mesh
alignment, making confusion network building and posterior decoding more
stable across platforms.
* Exclude metatags when writing out the vocabulary of binary
Ngram LMs.
* Fixed some missing dependencies in Visual Studio solution file.
More information about the SRILM-Announce
mailing list