FAQ: Asian spam

Thu Mar 27 04:48:37 CET 2003

> Do any of you know whether or not bogofilter's processing is correct for 
> the Chinese, Japanese, or Korean charsets?  Or is the parsing totally 
> bogus, but sufficiently repeatable to produce usable spam/ham 
> classifications?

It's definitely suboptimal.  Attached is the body of a spam I received
in the GB2312 charset.  (The body of this message should be UTF-8.)

In these two lines listing RAM prices, there are two ideographs (2 bytes
each in GB2312) which are possibly a brand name.

04.三星Rambus 512M/256M/128M/ PC133--500/220/120元
05.三星 128M/256M512M DDR266--180/390/680元

Since the spammer neglected a space between 三星 and Rambus, the lexer
identifies these two tokens.

get_token: 1 '三星rambus`
get_token: 1 '三星`

"rambus" is never identified as a token by itself.  Perhaps this isn't a
bad thing.

Next, some ABIT motherboards.

04.升技 AT7/BE7/BD7/KD7/IT7-680/490/360/450/740元
05.升技 SG-71/SR7-8X--330/410元

get_token: 1 '升技`
get_token: 1 '升技`

Because of the space, these are picked up nicely.

No space between "Sanyo" and these mobile phone models

三洋SCP510/550/600-2300/2600/1200元

get_token: 1 '三洋scp510`

results in "(Sanyo)scp510" being identified as a token.

This next example is more interesting.

　　敬礼！　　　

"　", which is repeated at the beginning and end, is a wide whitespace.

"敬礼" is two ideographs meaning something like "respectful courtesy".

"！" is a wide exclamation mark.

get_token: 1 '　　敬礼！　　　`

Alas, we see them being treated all together as one token.

Finally, the last line features wide whitespace, a wide comma, a wide
exclamation mark, two wide parentheses, and a number of ideographs
signifying up to 29 words.

　　　 我公司会经常更换产品，以备广大客户需求！（广州康泰科技有限公司）  

get_token: 1 '　　　`

The only token identified is pure whitespace.