charset implementtion progress

Matthias Andree matthias.andree at gmx.de
Wed Nov 27 04:03:04 CET 2002


David Relson <relson at osagesoftware.com> writes:

> Yes it is.  I know I'm ignorant in many areas - and language outside
> US-ASCII is one of them. We don't have the same cultural experience of
> living in a multi-lingual, multi-cultural, multi-alphabet as do
> Europeans.

So listen to your immigrants and tell your immigration office you want
more of them. Or go visit Québec or Mexico. ;-)

Seriously, I don't really speak more than two and a half languages (EN
DE FR), and these all fit into ISO-8859-15. I have hardly any contacts
to other cultures, integrated Silesian or Russian immigrants that saved
some of their culture, particularly with food or habits.

ISO-8859-1 does not list the French oe œ ligature used in sister (sœur)
and heart (cœur) though. In fact, any languages that I know fragments of
(from past vacation or something) fit into ISO-8859-15.

>>How does bogofilter know what character sets the user can read? How are
>>you telling which UTF-8 sub set I can read? And why should we go this
>>length at all? Let's use some existing library to canonicalize our stuff
>>to Unicode, register everything in Unicode and be done with it. The
>>user's teaching bogofilter will work out in the end.
>
> Got any libraries in mind?

iconv, jconv, recode for now. Haven't looked at APIs, how comprehensive
these are and how well-maintained. This is new territory to me as well.

> True.  Unfortunately, all tokens are treated the same way when the
> spamicity is calculated.  It might be _interesting_ to have a charset
> priority, i.e. "charset=XYZ" means "spam" (and skip the rest of the
> computation).  Of course that idea exists in the world.  It's known as a
> blacklist.

Yup, and maildrop or procmail do that already. So we're back to
weighting some traits over others.

>>Nevermind. The IANA list is there, and if spammers deviate from that,
>>the display of their message is suboptimal.
>
> As I don't wnat to even see their messages, it's worth my while to be
> proactive and catch their messages even if slightly garbled.

So you'd go for a similarity match rather than exact match in your
blacklist?

> I made that switch several weeks ago and haven't bothered to rebuild the
> database.  bogofilter continues to do a good job.  I see about 10% false
> negatives currently and -no- false positives.

I have tons of false positives, usually bounces, and occasionally very
short mails when the header weighs in and some Asian person asks (no
discrimination, it's just an observation), some of these people seem to
send their mail through the same servers as spammers do.

I believe the problem is that the bounce comprises major parts of a spam
mail, so it's tagged. Conclusion: don't use bogofilter to screen abuse
mailboxes. ;-)

-- 
Matthias Andree



More information about the bogofilter-dev mailing list