charset implementtion progress
relson at osagesoftware.com
Tue Nov 26 20:34:27 EST 2002
At 08:05 PM 11/26/02, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
> > My spamlist has approx 20,000 korean words in it, none of which are
> > readable. One of the ideas that the spambayes folks use is to convert
> > unreadable characters to questionmarks. For charset=korean, this could
> > be done to char above 0x80. In normal usage, '?' is processed as a word
> > separator by the lexer. Doing this conversion as a type of "case
> > folding", wouldn't affect the lexer, but would pass a much smaller set
> > of tokens to the spamicity calculation. Bogofilter quickly trains on
> > tokens like "?????ab??" and will correctly classify the message as
> > spam.
That's one of the nice things about options. They're optional. Use them
only if they apply for your site.
>That's the American way of doing things. Don't take this personal, but
>my experience tells me that many Americans simply are not aware of the
>needs of languages beyond US-ASCII. :-( Consider yourself lucky to get
>along with 94 printable characters.
Yes it is. I know I'm ignorant in many areas - and language outside
US-ASCII is one of them. We don't have the same cultural experience of
living in a multi-lingual, multi-cultural, multi-alphabet as do Europeans.
>How does bogofilter know what character sets the user can read? How are
>you telling which UTF-8 sub set I can read? And why should we go this
>length at all? Let's use some existing library to canonicalize our stuff
>to Unicode, register everything in Unicode and be done with it. The
>user's teaching bogofilter will work out in the end.
Got any libraries in mind?
> > Remember my mention of 20,000 korean tokens? With the question mark
> > technique, the number drops to 500, or so. This is a big win for saving
> > space.
>OTOH, the character set alone when emitted as a token is going to be
>quite indicative for far-east character sets.
True. Unfortunately, all tokens are treated the same way when the
spamicity is calculated. It might be _interesting_ to have a charset
priority, i.e. "charset=XYZ" means "spam" (and skip the rest of the
computation). Of course that idea exists in the world. It's known as a
> > Also, names sometimes use dashes and
> > sometimes use underscores. Summary: "charset=euc-kr" may appear
> > multiple times in the wordlist.
>Nevermind. The IANA list is there, and if spammers deviate from that,
>the display of their message is suboptimal.
As I don't wnat to even see their messages, it's worth my while to be
proactive and catch their messages even if slightly garbled.
> > And the last thing to mention is that the Robinson calculation uses all
> > tokens in the message (with spamicities outside of min_dev), so all the
> > korean symbols contribute to the score. This makes any individual token
> > less significant, even one as obivous as "euc-kr"
>Which, I believe, is the key why a Graham -> Robinson switch will give
>some people lots of false negatives.
I made that switch several weeks ago and haven't bothered to rebuild the
database. bogofilter continues to do a good job. I see about 10% false
negatives currently and -no- false positives.
A few days ago, I switched to using "min_dev=0.1", which ignores values
between 0.4 and 0.6 when computing the overall score. With robx=0.415,
this excludes previously unseen words from the computation. bogofilter
continues to work well.
More information about the Bogofilter-dev