Preprocessor for Bogofilter

Wed Jan 8 17:38:08 CET 2003

Michal Kosek <michauisbogofiltered at nowa-huta.krakow.pl> writes:

> What about charset conversion? Do you plan to convert everything to
> utf8? It would be nice...

Yup, we have mime code that has just become usable and needs to
stabilize a bit, and this should be able to pull the charset out and
parse it.

> But it may make dictionaries grow very much. For example, lot of
> people receive tons of spam from Korea. Now bogofilter does not
> recognize words in Korean texts. And it is good, because charset
> information is enough to classify mail as spam.  Another situation is
> when somebody has friends who speak Korean. In this case bogofilter
> should add every Korean word to help classify the message. That's why
> I think that such conversion would be good, but it should be optional.

Even if the conversion is present, you can still use maildrop or
bogofilter to kill off Korean stuff before feeding it to bogofilter.

It will have to be global for the system (a feature of the data base)
though, because everything would have to be canonicalized to the proper
charset.

> Before I wrote bogoprep I tried to think like a spammer and find as
> many ways to hide words typical for spam. Despite decoding base64 and
> qp there should be possibility to convert html &#number; to
> appropriate Unicode character. Do you know any other methods of hiding
> such words?

Oh, that's a good one (the HTML entities). Yes, color has already been
seen. Spammers send nonspam text as white-on-white to fool content
filters. They don't fool blacklist filters such as spamassassin though.

-- 
Matthias Andree