From yhifny at yahoo.com Thu Jul 6 20:31:43 2017 From: yhifny at yahoo.com (yasser hifny) Date: Fri, 7 Jul 2017 03:31:43 +0000 (UTC) Subject: [SRILM User List] how null nodes are created? References: <1336121674.275695.1499398303650.ref@mail.yahoo.com> Message-ID: <1336121674.275695.1499398303650@mail.yahoo.com> Hi, I have this LM and its pfsg is attached as a pdf. Can you please detail how the null nodes are created? \data\ngram 1=6ngram 2=7ngram 3=1 \1-grams:-0.5228788 -99 ~~-0.7781513-0.69897 cat -0.30103-1 dog -0.30103-1 set -0.30103-0.5228788 the -0.4771213 \2-grams:-0.05387538 ~~the -0.1439066-0.39794 cat~~ -0.5228788 cat set-0.1870866 dog~~ -0.1870866 set -0.2466723 the cat-0.69897 the dog \3-grams:-0.1618508 the cat \end\ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: test.pdf Type: application/pdf Size: 18036 bytes Desc: not available URL: From lfu20 at hotmail.com Fri Jul 7 00:24:10 2017 From: lfu20 at hotmail.com (Luis Uebel) Date: Fri, 7 Jul 2017 07:24:10 +0000 Subject: [SRILM User List] Job Positions Deep Learning, Natural Language Processing - Samsung Institute Message-ID: You can make a difference by bringing real applications for millions! Do you like to work in a global organization that is the biggest mobile hardware manufacture in the world? Full-time job positions in Samsung Institute in Manaus (Brazil). It is not required to speak Portuguese. The candidates who will be interested can send the resume for the follow address sidia at samsung.com. Please mention “Job Position – MIL” in the subject. Position 1 - Technical Specialist - HP1903 Graduated in Software Engineering, Computer Science, Electrical Engineering, Statistics, Mathematics, Physics or related field is required; PhD or Master Degree is a plus. Description Innovative professional with extensive experience in software development for Mobile Devices and advanced computer science technologies, such as: Artificial Intelligence (Machine Learning, NLP), Cloud Computing, Internet of Things and Augmented Reality applications. Solid know-how and skills to advise technical teams about the best technological solutions for vast range problems that are related to software, computer and mobile devices. Main Activities The Technical Specialist will bring up Innovative Projects and design the Software Architecture and Product Specification; support the development team during execution & validation; advice the Innovation Team about the latest technology trends; evaluate products, services, technical skills and company solutions in terms of technical feasibility, cost/benefits, compare different technical solutions and best providers to it. Position 2 – Senior Software Engineer – Advanced Machine Learning / NLP We are looking for a senior software engineer with experience in advanced machine learning. Minimum Requirements: • Minimum 4 years of professional experience in software development, with at least 2 years using machine learning techniques; • Proved experience with one or more advanced machine learning techniques: o Deep Networks (CNN, RNN, DBN, DBM, LSTM, …); o Deep Reinforcement Learning for Data, Text, Audio, Speech and/or Multimedia. • Proficiency in a neural network library (e.g., TensorFlow, Theano, Torch, MXNet, …); • Proficiency in Python, C/C++, Java and/or Scala for machine learning and NLP (desirable); • Notions of some NLP tools (e.g., OpenNLP, NLTK, …); • Experience in applying machine learning techniques for very large datasets/corpus; • Background in machine learning techniques (Generative Model, Discriminative Models, Neural Network, Bayesian, Expectation/Maximization, HMMs, Logistic Regression, Random Forest, …); • Master degree or PhD from an accredited University in Computer Science, Electrical Engineering, Statistics, Mathematics, Physics or related field is required. • Proficiency in Python, C/C++, Java and/or Scala; • Excellent knowledge in optimal computational environment required for data processing, mining and machine learning; • Knowledge of requirements of large scale systems operating on cloud environment and all involved architecture aspects. Desired Requirements: • Proven track record of research/publications on machine learning and artificial intelligence field; • Experience in implementing machine learning algorithms in a computer environment with constrains (CPU, memory and battery usage); • Proficiency in R, Octave, Matlab, Perl, Bash and/or Ruby; • Patent Tracking, and Open Source Code and Licenses Analysis; • Experience with Software Documentation in English; • Background in CUDA and/or CPU Parallel Programming. Positions 3 – Software Engineer – Machine Learning / NLP We are looking for a software engineer with academic knowledge and basic experience with Machine Learning and/or NLP Applications. Minimum Requirements: • Basic experience (~1 year) in software development using machine learning and/or natural language processing techniques; • Background in deploying software in a cloud environment, collecting and analyzing user data; • Knowledge of machine learning techniques: o Deep Networks (CNN, RNN, DBN, DBM, LSTM, …); o Deep Reinforcement Learning. • Knowledge of a neural network library (e.g., TensorFlow, Theano, Torch, MXNet, …) and/or NLP tools (e.g., OpenNLP, NLTK, …); • Proficiency in Python, C/C++, Java and/or Scala. • Background with very large datasets/corpus; • Background in machine learning techniques (Generative Model, Discriminative Models, Neural Network, Bayesian, Expectation/Maximization, HMMs, Logistic Regression, Random Forest, …). Desired Requirements: • Research/publications on machine learning, artificial intelligence and/or natural language processing fields; • Background in CUDA and/or CPU Parallel Programming; • Software development for restrict computer environments (CPU, memory and/or battery usage); • Proficiency in R, Octave, Matlab, Perl, Bash and/or Ruby; • Patent Tracking, and Open Source Code and Licenses Analysis; • Experience with Software Documentation in English. -------------- next part -------------- An HTML attachment was scrubbed... URL: From garyhess999 at hotmail.com Tue Aug 1 08:37:07 2017 From: garyhess999 at hotmail.com (Gary Hess) Date: Tue, 1 Aug 2017 15:37:07 +0000 Subject: [SRILM User List] Problem installing SRILM Message-ID: Hi -- I wonder if anyone can help with my installation problem. I decompressed the file "srilm-1.7.2.tar.gz" in the directory /home/gary/workspace/srilm. In the Makefile, I set "SRILM = /home/gary/workspace/srilm". Machine info: unknown:~/workspace/srilm> ./sbin/machine-type i686-m64 unknown:~/workspace/srilm> gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) Now I switch to the TCH shell. unknown:~/workspace/srilm> make World make: execvp: /home/gary/workspace/srilm: Permission denied make: *** /home/gary/workspace/srilm: Is a directory. Stop. There is a problem with the host name ("unknown"). Is that the real problem? I don't think so, but... Otherwise, I am getting "make: execvp: /home/gary/workspace/srilm: Permission denied". It looks like a permission problem (which I googled quite a bit). I also tried it with "sudo" but get the same error. Here is the directory listing: unknown:~/workspace/srilm> ls -ld drwxr-xr-x 15 gary gary 4096 Aug 1 15:28 . unknown:~/workspace/srilm> ls -l total 64340 -rw-r--r-- 1 gary gary 4638 Jun 24 2015 ACKNOWLEDGEMENTS -rw-r--r-- 1 gary gary 92666 Nov 9 2016 CHANGES drwxr-xr-x 2 gary gary 12288 May 25 2016 common -rw-r--r-- 1 gary gary 984 May 25 2016 Copyright drwxr-xr-x 2 gary gary 4096 Apr 3 2013 doc drwxr-xr-x 5 gary gary 4096 May 24 2013 dstruct drwxr-xr-x 6 gary gary 4096 Feb 9 2011 flm -rwxr-xr-x 1 gary gary 587 May 3 2016 go.build-android -rwxr-xr-x 1 gary gary 588 May 10 2016 go.build-android-hard -rwxr-xr-x 1 gary gary 581 May 4 2016 go.build-android-v8 -rw-r--r-- 1 gary gary 6909 Apr 5 2014 INSTALL drwxr-xr-x 6 gary gary 4096 Feb 9 2011 lattice drwxr-xr-x 2 gary gary 4096 May 24 2013 lib -rw-r--r-- 1 gary gary 12528 Jun 24 2015 License drwxr-xr-x 6 gary gary 4096 Feb 9 2011 lm -rw-r--r-- 1 gary gary 5063 Aug 1 15:26 Makefile drwxr-xr-x 12 gary gary 4096 Oct 13 2015 man drwxr-xr-x 5 gary gary 4096 Feb 9 2011 misc -rw-r--r-- 1 gary gary 587 Dec 2 2009 README -rw-r--r-- 1 gary gary 6 Nov 9 2016 RELEASE drwxr-xr-x 2 gary gary 4096 Jul 3 2015 sbin -rw-r----- 1 gary gary 65659816 Jul 31 10:44 srilm-1.7.2.tar.gz drwxr-xr-x 6 gary gary 4096 Apr 4 2013 utils drwxr-xr-x 3 gary gary 4096 Feb 9 2011 visual_studio drwxr-xr-x 5 gary gary 4096 Aug 2 2013 zlib Thanks in advance for any help. Gary Hess -------------- next part -------------- An HTML attachment was scrubbed... URL: From nshmyrev at yandex.ru Tue Aug 1 14:45:49 2017 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Wed, 02 Aug 2017 00:45:49 +0300 Subject: [SRILM User List] Problem installing SRILM In-Reply-To: References: Message-ID: <180521501623949@web34g.yandex.ru> 01.08.2017, 18:43, "Gary Hess" : > Hi -- I wonder if anyone can help with my installation problem. I decompressed the file "srilm-1.7.2.tar.gz" in the directory /home/gary/workspace/srilm. In the Makefile, I set "SRILM = /home/gary/workspace/srilm" You probably did something bad when you modified the makefile on top of that change. > Now I switch to the TCH shell. There is no need to change the shell. > There is a problem with the host name ("unknown"). Is that the real problem? I don't think so, but... It is not a problem. > Otherwise, I am getting "make: execvp: /home/gary/workspace/srilm: Permission denied". It looks like a permission problem (which I googled quite a bit). I also tried it with "sudo" but get the same error. This error is caused by your Makefile modification most likely. Maybe you can try from a clean state again and be more careful. If you still have your error, share the Makefile with your edits.Also share the build.log which you can create with a command make -d World 2>&1 > build.log From xulikui123321 at 163.com Wed Aug 2 00:41:21 2017 From: xulikui123321 at 163.com (=?GBK?B?0Ow=?=) Date: Wed, 2 Aug 2017 15:41:21 +0800 (CST) Subject: [SRILM User List] Generate new model with existed model and text Message-ID: <3de486a8.951e.15da1e45479.Coremail.xulikui123321@163.com> Hi， I trained a LM model, then my boss give me some text and tell me Strengthen the probability of ngram in these texts, what i used to do is generate the count from the text and merge them with old ngram count, then retrain a model. Is there some command or method to do this faster? -------------- next part -------------- An HTML attachment was scrubbed... URL: From garyhess999 at hotmail.com Wed Aug 2 00:40:58 2017 From: garyhess999 at hotmail.com (Gary Hess) Date: Wed, 2 Aug 2017 07:40:58 +0000 Subject: [SRILM User List] Problem installing SRILM In-Reply-To: <180521501623949@web34g.yandex.ru> References: , <180521501623949@web34g.yandex.ru> Message-ID: Hi Nickolay, I ran "make SRILM=$cwd World" and it compiled (suggestion from Andreas Stolcke). Best regards, Gary ________________________________ From: Nickolay V.Shmyrev Sent: Tuesday, August 1, 2017 11:45 PM To: Gary Hess; srilm-user at speech.sri.com Subject: Re: [SRILM User List] Problem installing SRILM 01.08.2017, 18:43, "Gary Hess" : > Hi -- I wonder if anyone can help with my installation problem. I decompressed the file "srilm-1.7.2.tar.gz" in the directory /home/gary/workspace/srilm. In the Makefile, I set "SRILM = /home/gary/workspace/srilm" You probably did something bad when you modified the makefile on top of that change. > Now I switch to the TCH shell. There is no need to change the shell. > There is a problem with the host name ("unknown"). Is that the real problem? I don't think so, but... It is not a problem. > Otherwise, I am getting "make: execvp: /home/gary/workspace/srilm: Permission denied". It looks like a permission problem (which I googled quite a bit). I also tried it with "sudo" but get the same error. This error is caused by your Makefile modification most likely. Maybe you can try from a clean state again and be more careful. If you still have your error, share the Makefile with your edits.Also share the build.log which you can create with a command make -d World 2>&1 > build.log -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Aug 9 13:55:51 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 9 Aug 2017 13:55:51 -0700 Subject: [SRILM User List] Generate new model with existed model and text In-Reply-To: <3de486a8.951e.15da1e45479.Coremail.xulikui123321@163.com> References: <3de486a8.951e.15da1e45479.Coremail.xulikui123321@163.com> Message-ID: <7d49df9d-b0af-abde-1cc3-5edeb0ffd8e2@icsi.berkeley.edu> On 8/2/2017 12:41 AM, 徐 wrote: > Hi， > I trained a LM model, then my boss give me some text and tell me > Strengthen the probability of ngram in these texts, what i used to do > is generate the count from the text and merge them with old ngram > count, then retrain a model. Is there some command or method to do > this faster? Combining the counts of your main training data with those from the adaptation data is one approach. There is no shortcut for this: you have to actually combine the counts (which you can do by just cat'ing the two files together), then train a new model. The other approach is to train a separate model on the adaptation data, then interpolate that model with the base model. This is usually more convenient because (1) you process the training data for the base model only once and (2) you can control the influence of the adaptation data by changing the weight of the models in adaptation. To interpolate two ngram models use ngram -order N -lm BASEMODEL -mix-lm NEWMODEL -lambda WEIGHT -write-lm ADAPTEDMODEL WEIGHT is the weight of the BASEMODEL, typically something close to 1, like 0.9, assuming the adaptation data is small compared to the main training corpus. For a comparison of the two LM adaptation approaches and more background see http://www.sciencedirect.com/science/article/pii/S0167639303001055 . Make sure you are not adapting on the test data that you use to get a realistic performance estimate. Otherwise your result with be overly optimistic and your boss will be disappointed later ;-) Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From macnet2008 at gmail.com Thu Aug 10 02:29:23 2017 From: macnet2008 at gmail.com (Mac Neth) Date: Thu, 10 Aug 2017 09:29:23 +0000 Subject: [SRILM User List] SRILM ngram-count speed Message-ID: Hello, I am building a LM out of a corpus text file of around 8 MB using SRILM "ngram-count" command, and it takes around 1 minute 30 seconds to build the langage model file. Each time I add a line or two to the corpus, I have to rebuild the LM file. I am using the command as follows : ngram-count -text corpus.txt -order 3 -lm model.lm I have been able to optimize the performance using the binary option with : ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm and the LM file is now produced in around 1 minute. Is there any further optimization to speed up the LM building. Thanks in advance, Mac From wen.wang at sri.com Thu Aug 10 02:47:22 2017 From: wen.wang at sri.com (Wen Wang) Date: Thu, 10 Aug 2017 02:47:22 -0700 Subject: [SRILM User List] SRILM ngram-count speed In-Reply-To: References: Message-ID: <1e3dac6b-e923-303f-95b2-340df29e48fa@sri.com> Mac, It seems that you need to incrementally update the corpus frequently. If this is the case, you don't have to compute n-gram counts every time. To save time and speed up training, you could first save n-gram counts from the current corpus, by ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz Then just collect n-gram counts for the additional text, denoted add.txt here, that you are going to append to corpus.txt ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz Then you could merge the n-gram counts, by ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz Now, you could build the LM just by loading the updated counts, new.3grams.gz, instead from the updated text: ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin Thanks, Wen On 8/10/17 2:29 AM, Mac Neth wrote: > Hello, > > I am building a LM out of a corpus text file of around 8 MB using > SRILM "ngram-count" command, and it takes around 1 minute 30 seconds > to build the langage model file. > > Each time I add a line or two to the corpus, I have to rebuild the LM file. > > I am using the command as follows : > > ngram-count -text corpus.txt -order 3 -lm model.lm > > I have been able to optimize the performance using the binary option with : > > ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm > > and the LM file is now produced in around 1 minute. > > Is there any further optimization to speed up the LM building. > > Thanks in advance, > > Mac > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user From macnet2008 at gmail.com Thu Aug 10 11:45:35 2017 From: macnet2008 at gmail.com (Mac Neth) Date: Thu, 10 Aug 2017 18:45:35 +0000 Subject: [SRILM User List] SRILM ngram-count speed Message-ID: Hi Wen, Thanks for that. I have tried your steps. But it seems the last step takes +/- the same time as initially : around 55sec: 1) few seconds ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz 2) few seconds ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz 3) few seconds ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz 4) around 55 seconds ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin I have added the option "-lm" in your command. Should I drop it ? Your command was: ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin Thanks, Mac 2017-08-10 9:47 GMT+00:00 Wen Wang : > Mac, > > It seems that you need to incrementally update the corpus frequently. If > this is the case, you don't have to compute n-gram counts every time. To > save time and speed up training, you could first save n-gram counts from the > current corpus, by > > ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz > > Then just collect n-gram counts for the additional text, denoted add.txt > here, that you are going to append to corpus.txt > > ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz > > Then you could merge the n-gram counts, by > > ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz > > Now, you could build the LM just by loading the updated counts, > new.3grams.gz, instead from the updated text: > > ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin > > Thanks, > > Wen > > > On 8/10/17 2:29 AM, Mac Neth wrote: >> >> Hello, >> >> I am building a LM out of a corpus text file of around 8 MB using >> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds >> to build the langage model file. >> >> Each time I add a line or two to the corpus, I have to rebuild the LM >> file. >> >> I am using the command as follows : >> >> ngram-count -text corpus.txt -order 3 -lm model.lm >> >> I have been able to optimize the performance using the binary option with >> : >> >> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm >> >> and the LM file is now produced in around 1 minute. >> >> Is there any further optimization to speed up the LM building. >> >> Thanks in advance, >> >> Mac >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > > > From stolcke at icsi.berkeley.edu Thu Aug 10 14:05:12 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 10 Aug 2017 14:05:12 -0700 Subject: [SRILM User List] SRILM ngram-count speed In-Reply-To: References: Message-ID: <136901b0-6cf5-cb29-d7de-e3a506d98693@icsi.berkeley.edu> The time spent in ngram-count is made up of two components: - time to count the ngrams - time to estimate the LM Right now your training corpus is small and the first portion is small compared to the second. So saving effort on that portion will not save you much overall. However, if your base model were trained on a substantial corpus the savings would be a larger portion of the overall time. Unfortunately you cannot do the LM estimation in an incremental way because various aspects of it (e.g., computing the smoothing parameters) depends on having the entire count distribution. However, you could use an approach where you don't train an entire new model including the added data, and just interpolate the base model with a small model trained only on the new data. (I just responded to a different post on the list describing this approach.) The result in model would be suboptimal but the speedup might be worth it, depending on your application. Andreas On 8/10/2017 11:45 AM, Mac Neth wrote: > Hi Wen, > > Thanks for that. I have tried your steps. But it seems the last step > takes +/- the same time as initially : around 55sec: > > 1) few seconds > ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz > > 2) few seconds > ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz > > 3) few seconds > ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz > > 4) around 55 seconds > ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin > > I have added the option "-lm" in your command. Should I drop it ? Your > command was: > > ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin > > Thanks, > > Mac > > > > 2017-08-10 9:47 GMT+00:00 Wen Wang : >> Mac, >> >> It seems that you need to incrementally update the corpus frequently. If >> this is the case, you don't have to compute n-gram counts every time. To >> save time and speed up training, you could first save n-gram counts from the >> current corpus, by >> >> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz >> >> Then just collect n-gram counts for the additional text, denoted add.txt >> here, that you are going to append to corpus.txt >> >> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz >> >> Then you could merge the n-gram counts, by >> >> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz >> >> Now, you could build the LM just by loading the updated counts, >> new.3grams.gz, instead from the updated text: >> >> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin >> >> Thanks, >> >> Wen >> >> >> On 8/10/17 2:29 AM, Mac Neth wrote: >>> Hello, >>> >>> I am building a LM out of a corpus text file of around 8 MB using >>> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds >>> to build the langage model file. >>> >>> Each time I add a line or two to the corpus, I have to rebuild the LM >>> file. >>> >>> I am using the command as follows : >>> >>> ngram-count -text corpus.txt -order 3 -lm model.lm >>> >>> I have been able to optimize the performance using the binary option with >>> : >>> >>> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm >>> >>> and the LM file is now produced in around 1 minute. >>> >>> Is there any further optimization to speed up the LM building. >>> >>> Thanks in advance, >>> >>> Mac >>> >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >> >> > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user > From wen.wang at sri.com Thu Aug 10 23:38:42 2017 From: wen.wang at sri.com (Wen Wang) Date: Thu, 10 Aug 2017 23:38:42 -0700 Subject: [SRILM User List] SRILM ngram-count speed In-Reply-To: References: Message-ID: Mac, please check out Andreas' suggestions on other ways to speed up your LM training. My suggestion is mostly based on the cases that (1) if you need to do this kind of incremental update of the corpus frequently or (2) the original corpus.txt file is already quite large. Sorry, that's a typo, you should have -write-binary-lm -lm model.bin. Thanks, Wen On 8/10/17 11:45 AM, Mac Neth wrote: > Hi Wen, > > Thanks for that. I have tried your steps. But it seems the last step > takes +/- the same time as initially : around 55sec: > > 1) few seconds > ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz > > 2) few seconds > ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz > > 3) few seconds > ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz > > 4) around 55 seconds > ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin > > I have added the option "-lm" in your command. Should I drop it ? Your > command was: > > ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin > > Thanks, > > Mac > > > > 2017-08-10 9:47 GMT+00:00 Wen Wang : >> Mac, >> >> It seems that you need to incrementally update the corpus frequently. If >> this is the case, you don't have to compute n-gram counts every time. To >> save time and speed up training, you could first save n-gram counts from the >> current corpus, by >> >> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz >> >> Then just collect n-gram counts for the additional text, denoted add.txt >> here, that you are going to append to corpus.txt >> >> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz >> >> Then you could merge the n-gram counts, by >> >> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz >> >> Now, you could build the LM just by loading the updated counts, >> new.3grams.gz, instead from the updated text: >> >> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin >> >> Thanks, >> >> Wen >> >> >> On 8/10/17 2:29 AM, Mac Neth wrote: >>> Hello, >>> >>> I am building a LM out of a corpus text file of around 8 MB using >>> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds >>> to build the langage model file. >>> >>> Each time I add a line or two to the corpus, I have to rebuild the LM >>> file. >>> >>> I am using the command as follows : >>> >>> ngram-count -text corpus.txt -order 3 -lm model.lm >>> >>> I have been able to optimize the performance using the binary option with >>> : >>> >>> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm >>> >>> and the LM file is now produced in around 1 minute. >>> >>> Is there any further optimization to speed up the LM building. >>> >>> Thanks in advance, >>> >>> Mac >>> >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user >> >> From tibazaki.alhabba at wmich.edu Mon Aug 14 10:27:54 2017 From: tibazaki.alhabba at wmich.edu (Tiba Zaki Abdulhameed Abdulhameed) Date: Mon, 14 Aug 2017 17:27:54 +0000 Subject: [SRILM User List] Interpolation output In-Reply-To: References: , Message-ID: Hi, Hope this email finds you well. Please I need help. I need to do interpolated LM I have read the massage bellow http://www.speech.sri.com/pipermail/srilm-user/2013q4/001583.html my command line ngram -unk -bayes 0 -lm LM1Class_Based.lm -mix-lm LM2tri.lm -classes output.classes -write-lm LM3Interpolated.lm -ppl data/test I have error write() method not implemented error writing LM3Interpolated.lm I can't remove -bayes because I have read that to mix different type of LM we need it to be dynamic So what is the best way to do so? please I really appreciate your advice. the documentations are not clearly identify this issue Best Regards Tiba -------------- next part -------------- An HTML attachment was scrubbed... URL: From tibazaki.alhabba at wmich.edu Mon Aug 14 10:35:14 2017 From: tibazaki.alhabba at wmich.edu (Tiba Zaki Abdulhameed Abdulhameed) Date: Mon, 14 Aug 2017 17:35:14 +0000 Subject: [SRILM User List] Interpolated LM Message-ID: Hi, Hope this email finds you well. Please I need help. I need to do interpolated LM. my command line ngram -unk -bayes 0 -lm LM1Class_Based.lm -mix-lm LM2tri.lm -classes output.classes -write-lm LM3Interpolated.lm -ppl data/test I have error write() method not implemented error writing LM3Interpolated.lm I can't remove -bayes because I have read that to mix different type of LM we need it to be dynamic So what is the best way to do so? please I really appreciate your advice. the documentations are not clearly identify this issue Best Regards Tiba -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Aug 14 11:12:21 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 14 Aug 2017 11:12:21 -0700 Subject: [SRILM User List] Interpolated LM In-Reply-To: References: Message-ID: <68cfafd7-093d-5b8e-51f1-cd7c8c461346@icsi.berkeley.edu> The error message says there is no way to write the interpolated model to a single model file. But wherever you would apply (evaluate) the interpolated LM with SRILM, you can just use the same options -bayes 0 -lm LM1Class_Based.lm -mix-lm LM2tri.lm -classes output.classes to reconstruct the interpolated model. If your goal is to "convert" the interpolated LM into a standard word-based ngram model, you can try to first expand the class-based LM into a word-based LM: ngram -lm LM1Class_Based.lm -classes output.classes -expand-classes K -write-lm LM1Class-EXPANDED.lm K is the maximal length of ngrams that are allowed to result from expanding the class tokens. This is an approximation, and can blow up the model size combinatorially, so check the perplexity after expansion. If it looks reasonable you can then interpolate the expanded LM with your other LM without using the -bayes option. Andreas On 8/14/2017 10:35 AM, Tiba Zaki Abdulhameed Abdulhameed wrote: > Hi, > Hope this email finds you well. > Please I need help. I need to do interpolated LM. > my command line > > ngram -unk -bayes 0 -lm LM1Class_Based.lm -mix-lm LM2tri.lm > -classes output.classes -write-lm LM3Interpolated.lm -ppl data/test > > I have error > > write() method not implemented > error writing LM3Interpolated.lm > I can't remove -bayes because I have read that to mix different type > of LM we need it to be dynamic > So what is the best way to do so? please I really appreciate your > advice. the documentations are not clearly identify this issue > > Best Regards > Tiba > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From shreyas1696 at gmail.com Sat Aug 19 01:56:19 2017 From: shreyas1696 at gmail.com (Shreya Singh) Date: Sat, 19 Aug 2017 14:26:19 +0530 Subject: [SRILM User List] exact command for combining more than two language models in srilm Message-ID: Hi, I would like to know whether there is a command for combining more than two language models in srilm. i know for only two lms the command is : ngram -order N -lm LM1 -mix-lm LM2 -lambda W -write-lm MIXLMWhere N is the maximum ngram order in the two LMs, LM1, LM2 are the input models, W is the weight to give to LM1, and MIXLM is the merged model file. what should i use for more than two lms? Regards, Shreya -- *shreya .. * -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Aug 19 10:08:08 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 19 Aug 2017 10:08:08 -0700 Subject: [SRILM User List] exact command for combining more than two language models in srilm In-Reply-To: References: Message-ID: <7c73df97-b10a-61d0-50b1-565bc82645a8@icsi.berkeley.edu> On 8/19/2017 1:56 AM, Shreya Singh wrote: > Hi, > I would like to know whether there is a command for combining more > than two language models in srilm. i know for only two lms the command > is : > ngram -order N -lm LM1 -mix-lm LM2 -lambda W -write-lm MIXLM > Where N is the maximum ngram order in the two LMs, LM1, LM2 are the > input models, W is the weight to give to LM1, and MIXLM is the merged > model file. > > what should i use for more than two lms? ngram -order N \ -lm LM0 -lambda W0 \ -mix-lm LM1 \ -mix-lm2 LM2 -mix-lambda2 W2 \ -mix-lm3 LM3 -mix-lambda3 W3 \ ... -mix-lm9 LM9 -mix-lambda9 W9 \ -write-lm MIXLM As you can see there is no option for the weight of LM1, since that is implicitly given by 1 minus the sum of the other weights. Because this syntax is a little inconsistent and limited to 10 models, there is also a more general mechanism, which reads the model specification from a file. Here is the relevant section from the ngram(1) man page: > -read-mix-lms > Read a list of linearly interpolated (mixture) LMs and > their weights from the file specified with -lm, instead of gathering > this information from the command line options > above. Each line in file starts with the filename containing the > component > LM, followed by zero or more component-specific options: > > -weight W the prior weight given to the component LM > > -order N the maximal ngram order to use > > -type T the LM type, one of ARPA (the default), > COUNTLM, MAXENT, LMCLIENT, or MSWEBLM > > -classes C the word class definitions for the > component LM (which must be of type ARPA) > > -cache-served-ngrams > enables client-side caching for LMs of > type LMCLIENT or MSWEBLM. > > The global options -bayes, -bayes-scale, and > -context-priors still apply with -read-mix-lms. When -bayes is NOT > used, the > interpolation is static by ngram merging, and forces all > component LMs to be of type ARPA or MAXENT. > > -cache length Note the file from which the model specification is read must be given with the -lm file option. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From kalpeshk2011 at gmail.com Mon Sep 4 16:24:22 2017 From: kalpeshk2011 at gmail.com (Kalpesh Krishna) Date: Tue, 5 Sep 2017 04:54:22 +0530 Subject: [SRILM User List] Unigram Cache Model Message-ID: Hi everyone, I'm trying to implement the KN5+cache model mentioned in Mikolov's PhD Thesis, http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf in Table 4.1. By using the command "./ngram -lm LM -ppl ptb.test.txt -unk -order 5 -cache 192 -cache-lambda 0.1" I managed to achieve a ppl value of 126.74 (I tuned `cache` and `cache-lambda`). What additional steps are needed to exactly reproduce the result? (125.7) I generated my LM using "./ngram-count -lm LM -unk -kndiscount -order 5 -text ptb.train.txt -interpolate -gt3min 1 -gt4min 1 -gt5min 1". Thank you, Kalpesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Sep 5 09:12:38 2017 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 5 Sep 2017 09:12:38 -0700 Subject: [SRILM User List] Unigram Cache Model In-Reply-To: References: Message-ID: <2fe69599-4a3b-e1f3-40b9-e848accc05d4@icsi.berkeley.edu> On 9/4/2017 4:24 PM, Kalpesh Krishna wrote: > Hi everyone, > I'm trying to implement the KN5+cache model mentioned in Mikolov's PhD > Thesis, http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf > in Table 4.1. > By using the command "./ngram -lm LM -ppl ptb.test.txt -unk -order 5 > -cache 192 -cache-lambda 0.1" I managed to achieve a ppl value of > 126.74 (I tuned `cache` and `cache-lambda`). What additional steps are > needed to exactly reproduce the result? (125.7) > I generated my LM using "./ngram-count -lm LM -unk -kndiscount -order > 5 -text ptb.train.txt -interpolate -gt3min 1 -gt4min 1 -gt5min 1". > First off, does the ppl obtained with just the KN ngram model match? About the cache LM, Tomas writes > We also report the perplexity of the best n-gram model (KN5) when > using unigram cache model (as implemented in the SRILM toolkit). We > have used several > unigram cache models interpolated together, with different lengths of > the cache history > (this works like a crude approximation of cache decay, ie. words > further in the history > have lower weight). So he didn't just use a single cache LM as implemented by the ngram -cash option. He must have used multiple versions of this model (with different parameter values), saved out the word-level probabilities, and interpolated them off-line. You can run an individual cache LM and save out the probabilities using ngram -vocab VOCAB -null -cache 192 -cache-lambda 1 -ppl TEST -debug 2 > TEST.ppl Repeat this several times with different -cache parameters, and also for the KN ngram. Then use compute-best-mix on all the output files to determine the best mixture weights (of course you need to do this using a held-out set, not the actual test set). Then you do the same for the test set, but use compute-best-mix lambda='....' precision=1000 ppl-file ppl-file ... where you provide the weights from the held-out set to the lambda= parameter. (The precision parameter is such that it won't iterate.) This will give you the test-set perplexity. Of course you still might have trouble getting the exact same results since Tomas didn't disclose the exact parameter values he used. But since you're already within 1 perplexity point of his results I would question whether this matters. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From kalpeshk2011 at gmail.com Tue Sep 5 11:18:39 2017 From: kalpeshk2011 at gmail.com (Kalpesh Krishna) Date: Tue, 5 Sep 2017 23:48:39 +0530 Subject: [SRILM User List] Unigram Cache Model In-Reply-To: <2fe69599-4a3b-e1f3-40b9-e848accc05d4@icsi.berkeley.edu> References: <2fe69599-4a3b-e1f3-40b9-e848accc05d4@icsi.berkeley.edu> Message-ID: Hi Andreas, > First off, does the ppl obtained with just the KN ngram model match? Yes, I could exactly reproduce the 3-gram and 5-gram KN ppl numbers. I had to use the -interpolate and -gtXmin 1 flags to replicate the results though. > Of course you still might have trouble getting the exact same results since Tomas didn't disclose the exact parameter values he used. Thanks a lot for the method! I was suspecting something was missing. Best Regards, Kalpesh ‌ On Tue, Sep 5, 2017 at 9:42 PM, Andreas Stolcke wrote: > On 9/4/2017 4:24 PM, Kalpesh Krishna wrote: > > Hi everyone, > I'm trying to implement the KN5+cache model mentioned in Mikolov's PhD > Thesis, http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf in Table 4.1. > By using the command "./ngram -lm LM -ppl ptb.test.txt -unk -order 5 -cache > 192 -cache-lambda 0.1" I managed to achieve a ppl value of 126.74 (I tuned > `cache` and `cache-lambda`). What additional steps are needed to exactly > reproduce the result? (125.7) > I generated my LM using "./ngram-count -lm LM -unk -kndiscount -order 5 > -text ptb.train.txt -interpolate -gt3min 1 -gt4min 1 -gt5min 1". > > First off, does the ppl obtained with just the KN ngram model match? > > About the cache LM, Tomas writes > > We also report the perplexity of the best n-gram model (KN5) when > using unigram cache model (as implemented in the SRILM toolkit). We have > used several > unigram cache models interpolated together, with different lengths of the > cache history > (this works like a crude approximation of cache decay, ie. words further > in the history > have lower weight). > > So he didn't just use a single cache LM as implemented by the ngram -cash > option. He must have used multiple versions of this model (with different > parameter values), saved out the word-level probabilities, and interpolated > them off-line. > > You can run an individual cache LM and save out the probabilities using > > ngram -vocab VOCAB -null -cache 192 -cache-lambda 1 -ppl TEST > -debug 2 > TEST.ppl > > Repeat this several times with different -cache parameters, and also for > the KN ngram. > > Then use compute-best-mix on all the output files to determine the best > mixture weights (of course you need to do this using a held-out set, not > the actual test set). > > Then you do the same for the test set, but use > > compute-best-mix lambda='....' precision=1000 ppl-file ppl-file ... > > where you provide the weights from the held-out set to the lambda= > parameter. (The precision parameter is such that it won't iterate.) This > will give you the test-set perplexity. > > Of course you still might have trouble getting the exact same results > since Tomas didn't disclose the exact parameter values he used. But since > you're already within 1 perplexity point of his results I would question > whether this matters. > > Andreas > > > > -- Kalpesh Krishna, Junior Undergraduate, Electrical Engineering, IIT Bombay -------------- next part -------------- An HTML attachment was scrubbed... URL: