Preprocessor for Bogofilter

Sat Jan 4 22:32:32 CET 2003

At 04:02 PM 1/4/03, Michal Kosek wrote:

> > your other features are very interesting, especially the html comment
> > rearranging.  i believe the next release of bogofilter will decode
> > base64 and quoted-printable natively, leaving unbase64 and those features
> > of your script unnecessary.  but perhaps it would be worthwhile
> > to include features like comment rearranging in a future verion
> > of bogofilter.
>
>Will it be possible to turn off base64/qp decoding? There is a small
>problem, because content must be decoded before rearranging takes
>place. My script leaves the headers, so output may confuse bogofilter.

Michal,

I haven't made the time to look at your script, so if I ask dumb questions, 
please pardon my ignorance.  The cvs version of bogofilter is presently 
doing mime processing, which includes decoding of base64, qp, and uuencoded 
text.  It also has a new lexer for html.  With this built into bogofilter, 
it would seem that your script is not needed.  Am I correct or incorrect in 
this analysis?  If incorrect, can you explain the benefits in having your 
script in addition to bogofilter's builtin capabilities?

>What about charset conversion? Do you plan to convert everything to
>utf8? It would be nice... But it may make dictionaries grow very
>much. For example, lot of people receive tons of spam from Korea. Now
>bogofilter does not recognize words in Korean texts. And it is good,
>because charset information is enough to classify mail as spam.
>Another situation is when somebody has friends who speak Korean. In
>this case bogofilter should add every Korean word to help classify the
>message. That's why I think that such conversion would be good, but it
>should be optional.

Conversion to utf8 has been suggested before and will likely be added at 
some point.  I personally don't have a need as my email is all in english, 
which has a restricted character set.  Having the conversion be optional is 
probably the right way to do it.

I deal with asian language messages, which I presume to be spam, by using a 
procmailrc which moves them all into a "spam-unreadable" file.  For those 
who can use it well, bogofilter has a "replace_nonascii_characters" option 
that substitutes a '?' for non ascii charaters.  This effectively reduces 
korean tokens to sequences of question marks and keeps them from filling up 
the wordlists.

>Before I wrote bogoprep I tried to think like a spammer and find as
>many ways to hide words typical for spam. Despite decoding base64 and
>qp there should be possibility to convert html &#number; to
>appropriate Unicode character. Do you know any other methods of hiding
>such words?

Conversion of &#number; to characters can be added if it becomes 
necessary.  Presently bogofilter ignores numbers and many special 
characters so it would ignore these.  Spam could sneak past bogofilter by 
containing "nice" words and having the "spammish" words disguised in this 
manner.  However as the user trains bogofilter with such disguised 
messages, bogofilter will learn to call them spam.  That's the beauty of 
training a spam filter like bogofilter.

David