charset implementtion progress

Wed Nov 27 02:34:27 CET 2002

At 08:05 PM 11/26/02, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > My spamlist has approx 20,000 korean words in it, none of which are
> > readable.  One of the ideas that the spambayes folks use is to convert
> > unreadable characters to questionmarks.  For charset=korean, this could
> > be done to char above 0x80.  In normal usage, '?' is processed as a word
> > separator by the lexer.  Doing this conversion as a type of "case
> > folding", wouldn't affect the lexer, but would pass a much smaller set
> > of tokens to the spamicity calculation.  Bogofilter quickly trains on
> > tokens like "?????ab??" and will correctly classify the message as
> > spam.

That's one of the nice things about options.  They're optional.  Use them 
only if they apply for your site.

>That's the American way of doing things. Don't take this personal, but
>my experience tells me that many Americans simply are not aware of the
>needs of languages beyond US-ASCII. :-( Consider yourself lucky to get
>along with 94 printable characters.

Yes it is.  I know I'm ignorant in many areas - and language outside 
US-ASCII is one of them. We don't have the same cultural experience of 
living in a multi-lingual, multi-cultural, multi-alphabet as do Europeans.

>How does bogofilter know what character sets the user can read? How are
>you telling which UTF-8 sub set I can read? And why should we go this
>length at all? Let's use some existing library to canonicalize our stuff
>to Unicode, register everything in Unicode and be done with it. The
>user's teaching bogofilter will work out in the end.

Got any libraries in mind?

> > Remember my mention of 20,000 korean tokens?  With the question mark
> > technique, the number drops to 500, or so.  This is a big win for saving
> > space.
>
>OTOH, the character set alone when emitted as a token is going to be
>quite indicative for far-east character sets.

True.  Unfortunately, all tokens are treated the same way when the 
spamicity is calculated.  It might be _interesting_ to have a charset 
priority, i.e. "charset=XYZ" means "spam" (and skip the rest of the 
computation).  Of course that idea exists in the world.  It's known as a 
blacklist.

> > Also, names sometimes use dashes and
> > sometimes use underscores.  Summary: "charset=euc-kr" may appear
> > multiple times in the wordlist.
>
>Nevermind. The IANA list is there, and if spammers deviate from that,
>the display of their message is suboptimal.

As I don't wnat to even see their messages, it's worth my while to be 
proactive and catch their messages even if slightly garbled.

> > And the last thing to mention is that the Robinson calculation uses all
> > tokens in the message (with spamicities outside of min_dev), so all the
> > korean symbols contribute to the score.  This makes any individual token
> > less significant, even one as obivous as "euc-kr"
>
>Which, I believe, is the key why a Graham -> Robinson switch will give
>some people lots of false negatives.

I made that switch several weeks ago and haven't bothered to rebuild the 
database.  bogofilter continues to do a good job.  I see about 10% false 
negatives currently and -no- false positives.

A few days ago, I switched to using "min_dev=0.1", which ignores values 
between 0.4 and 0.6 when computing the overall score.  With robx=0.415, 
this excludes previously unseen words from the computation.  bogofilter 
continues to work well.