korean spam

David Relson relson at osagesoftware.com
Thu Oct 10 03:34:15 CEST 2002


Greetings,

FWIW, I got curious the other day about a couple of bunches of korean spam 
that I had from ???.co.kr and ???.hanmail.net.  I added a "-c" (check) 
option to lexertest that would tell me whether each token was:

	normal - all characters in range 0x20...0x7f
	high   - all characters >= 0x80
	mixed  - both normal and high characters

At the end, I have lexertest print out the counts for all three 
groupings.  Below is info on the two files.  First come the line, word, and 
character counts from wc.  Then come the word and message counts from 
lexertest.

                   wc -l    wc -w    wc -c      lexer-wrds    lexer-msgs 
normal	high	mixed
co.kr.mbx :        23287    86113    1591863    64281 words   339 
messages    	4646 	2737    145
hanmail.net.mbx :   6696    26344     539398    21135 words   104 
messages    	1185 	1139     54

Certainly, my wordlists (both good and spam) have thousands of words that I 
can't read at all.  Offhand, I'd guess that most of those words qualify as 
"high", though some of them likely contain 1 or 2 normal characters.  I 
don't like having my lists be filled with stuff that's totally junk and am 
wondering if bogofilter should do anything about this.  On the other hand, 
they may not have any measurable effect on bogofilter.

Any thoughts from y'all???

David


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list