Preprocessor for Bogofilter

Michal Kosek michauisbogofiltered at nowa-huta.krakow.pl
Sat Jan 4 22:02:44 CET 2003


On Fri, 3 Jan 2003, Allyn Fratkin wrote:

> > bogoprep decodes both content and subjects (unbase64 decodes only
> > subjects).
>
> you've got that backwards, unbase64 decodes content but not subjects.

Sorry, once again my mistake :)

> your other features are very interesting, especially the html comment
> rearranging.  i believe the next release of bogofilter will decode
> base64 and quoted-printable natively, leaving unbase64 and those features
> of your script unnecessary.  but perhaps it would be worthwhile
> to include features like comment rearranging in a future verion
> of bogofilter.

Will it be possible to turn off base64/qp decoding? There is a small
problem, because content must be decoded before rearranging takes
place. My script leaves the headers, so output may confuse bogofilter.

What about charset conversion? Do you plan to convert everything to
utf8? It would be nice... But it may make dictionaries grow very
much. For example, lot of people receive tons of spam from Korea. Now
bogofilter does not recognize words in Korean texts. And it is good,
because charset information is enough to classify mail as spam.
Another situation is when somebody has friends who speak Korean. In
this case bogofilter should add every Korean word to help classify the
message. That's why I think that such conversion would be good, but it
should be optional.

Before I wrote bogoprep I tried to think like a spammer and find as
many ways to hide words typical for spam. Despite decoding base64 and
qp there should be possibility to convert html &#number; to
appropriate Unicode character. Do you know any other methods of hiding
such words?

-- 
michau@
"Do you think," said a Woodpecker who had been busy making a hole in the
 table, "that there might be a problem with the name `UNIX?' I mean, it
 does sort of suggest being less than a man."   [ "Alice in UNIX Land" ]





More information about the bogofilter-dev mailing list