korean spam
David Relson
relson at osagesoftware.com
Thu Oct 10 03:34:15 CEST 2002
Greetings,
FWIW, I got curious the other day about a couple of bunches of korean spam
that I had from ???.co.kr and ???.hanmail.net. I added a "-c" (check)
option to lexertest that would tell me whether each token was:
normal - all characters in range 0x20...0x7f
high - all characters >= 0x80
mixed - both normal and high characters
At the end, I have lexertest print out the counts for all three
groupings. Below is info on the two files. First come the line, word, and
character counts from wc. Then come the word and message counts from
lexertest.
wc -l wc -w wc -c lexer-wrds lexer-msgs
normal high mixed
co.kr.mbx : 23287 86113 1591863 64281 words 339
messages 4646 2737 145
hanmail.net.mbx : 6696 26344 539398 21135 words 104
messages 1185 1139 54
Certainly, my wordlists (both good and spam) have thousands of words that I
can't read at all. Offhand, I'd guess that most of those words qualify as
"high", though some of them likely contain 1 or 2 normal characters. I
don't like having my lists be filled with stuff that's totally junk and am
wondering if bogofilter should do anything about this. On the other hand,
they may not have any measurable effect on bogofilter.
Any thoughts from y'all???
David
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list