strange symbols
Andreas Stolcke
stolcke at speech.sri.com
Tue Jul 29 23:27:13 PDT 2008
They look like ASCII control characters (character values < 0x20).
You need to do a better job filtering your training data.
--Andreas
In message <79a042480807291803w44eb15c8ic8d4c4ef4a0e8182 at mail.gmail.com>you wro
te:
>
> Dear all,
> I'm using srilm on some data crawled from the Web. The lm contains some
> strange symbols as these:
> \1-grams:
> -6.774207 ^A 0
> -6.774207 ^C
> -6.774207 ^D
> -6.774207 ^E 0
> -6.774207 ^F 0
> -6.774207 ^G 0
> -6.774207 ^H 0
> -6.774207 ^K 0
> -6.774207 ^N 0
> -6.774207 ^O
> -6.774207 ^P
> -6.774207 ^T 0
> -6.774207 ^X
> -6.774207 ^Y 0
> -6.774207 ^\
> -6.774207 ^]
> -6.774207 ^^ 0
> -6.774207 ^_
>
> these symbols are not the simple combination of ^ and a letter but it seems
> to be something different as a character that has been truncated or
> something similar.
> Do u have an idea what they are and how to remove them?
>
> thanks a lot
> Marco
>
> ------=_Part_45139_5077409.1217379820193
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> <div dir="ltr">Dear all, <br>I'm using srilm on some data crawled from th
> e Web. The lm contains some strange symbols as these:<br>\1-grams:<br>-6.7742
> 07 ^A 0<br>
> -6.774207 ^C<br>-6.774207 &nbs
> p; ^D<br>-6.774207 ^E&n
> bsp; 0<br>
> -6.774207 ^F  
> ; 0<br>-6.774207 ^G &nbs
> p; 0<br>-6.774207 ^H &nb
> sp; 0<br>-6.774207 ^K &n
> bsp; 0<br>-6.774207 ^N&
> nbsp; 0<br>-6.774207 &nb
> sp; ^O<br>-6.774207 ^P<br>-6.774207
> ^T 0<br>-6.77420
> 7 ^X<br>
> -6.774207 ^Y  
> ; 0<br>-6.774207 ^\<br>-6.774207 &nb
> sp; ^]<br>-6.774207 &nbs
> p; ^^ 0<br>-6.774207 &nb
> sp; ^_<br><br>these symbols are not the simple combination of ^ and a l
> etter but it seems to be something different as a character that has been tru
> ncated or something similar.<br>
> Do u have an idea what they are and how to remove them?<br><br>thanks a lot<b
> r>Marco<br></div>
>
> ------=_Part_45139_5077409.1217379820193--
More information about the SRILM-User
mailing list