FAQ: Asian spam
Clint Adams
schizo at debian.org
Thu Mar 27 04:48:37 CET 2003
> Do any of you know whether or not bogofilter's processing is correct for
> the Chinese, Japanese, or Korean charsets? Or is the parsing totally
> bogus, but sufficiently repeatable to produce usable spam/ham
> classifications?
It's definitely suboptimal. Attached is the body of a spam I received
in the GB2312 charset. (The body of this message should be UTF-8.)
In these two lines listing RAM prices, there are two ideographs (2 bytes
each in GB2312) which are possibly a brand name.
04.三星Rambus 512M/256M/128M/ PC133--500/220/120元
05.三星 128M/256M512M DDR266--180/390/680元
Since the spammer neglected a space between 三星 and Rambus, the lexer
identifies these two tokens.
get_token: 1 '三星rambus`
get_token: 1 '三星`
"rambus" is never identified as a token by itself. Perhaps this isn't a
bad thing.
Next, some ABIT motherboards.
04.升技 AT7/BE7/BD7/KD7/IT7-680/490/360/450/740元
05.升技 SG-71/SR7-8X--330/410元
get_token: 1 '升技`
get_token: 1 '升技`
Because of the space, these are picked up nicely.
No space between "Sanyo" and these mobile phone models
三洋SCP510/550/600-2300/2600/1200元
get_token: 1 '三洋scp510`
results in "(Sanyo)scp510" being identified as a token.
This next example is more interesting.
敬礼!
" ", which is repeated at the beginning and end, is a wide whitespace.
"敬礼" is two ideographs meaning something like "respectful courtesy".
"!" is a wide exclamation mark.
get_token: 1 ' 敬礼! `
Alas, we see them being treated all together as one token.
Finally, the last line features wide whitespace, a wide comma, a wide
exclamation mark, two wide parentheses, and a number of ideographs
signifying up to 29 words.
我公司会经常更换产品,以备广大客户需求!(广州康泰科技有限公司)
get_token: 1 ' `
The only token identified is pure whitespace.
More information about the Bogofilter
mailing list