korean spam

David Relson relson at osagesoftware.com
Thu Oct 10 05:45:04 CEST 2002


At 09:53 PM 10/9/02, Graham Wilson wrote:
>On Wed, Oct 09, 2002 at 09:34:15PM -0400, David Relson wrote:
> > FWIW, I got curious the other day about a couple of bunches of korean
> > spam that I had from ???.co.kr and ???.hanmail.net.  I added a "-c"
>
>i have been getting chinese (gb2312) and japenese spam (iso-2202-jp)
>spam.
>
> > Certainly, my wordlists (both good and spam) have thousands of words
> > that I can't read at all.  Offhand, I'd guess that most of those words
> > qualify as "high", though some of them likely contain 1 or 2 normal
> > characters.  I don't like having my lists be filled with stuff that's
> > totally junk and am wondering if bogofilter should do anything about
> > this.  On the other hand, they may not have any measurable effect on
> > bogofilter.
>
>i figured it was good to have those tokens in the database because i
>figured they were words in chinese or japenese that would give the
>message away as spam. this doesnt really seem to be happening though.
>
>it might be more useful if the lexer had the ability to produce tokens
>from messages written in eastern languages that could be used like we
>use the tokens in english (and other western language) messages. like i
>said, i dont feel like bogofilter is doing a good job at that.

Those tokens may well be recognizable as words.  Certainly, seeing those 
byte sequences is a sure sign of spam for _me_.  My thoughts run along the 
lines of "Do I have to clutter up my word lists with that stuff?" and "If I 
have two or more sequences of that stuff, I can just say it's spam".

Haven't had any of those arrive since I started running bogofilter, so I 
can't say for sure what'll happen with them.


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list