Preprocessor for Bogofilter

Michal Kosek michauisbogofiltered at nowa-huta.krakow.pl
Fri Jan 3 22:28:53 CET 2003


Hello everyone,

I have written small utility (attached) that - I think - could be put
into contrib of bogofilter.

bogoprep.pl is a preprocessor that tries to extract as much
information from message as possible (by decoding qp and base64
content and subjects, preventing original encoding information). On
the other hand it also prevents wordlists from growing too much (by
changing order of some html tags and unifying encoding).

QP and BASE64

bogoprep decodes both content and subjects (unbase64 decodes only
subjects). It also adds some dummy words (as "hdrbase64encoded") - the
fact that some subject was somehow encoded may help to qualify the
message. Base64 decoding subroutine I written does not decode line by
line, but gathers groups of four base64 characters and then decodes
them, so even e.g. d GV zdA == is decoded properly (one day spammers
may send spam base64 encoded and with less than 4 characters per line
to make decoding harder).

Charsets

Now unification of encoding is done only for most popular Polish
encodings (i.e. cp-1250 and utf-8 encoded Polish letters may be
converted to the most widely used iso-8859-2 encoding), but conversion
table can be easily changed (everything is in %conv hash).

Order of tags

Spammers use some techniques to hide words typical for spam. They put
nonexistent html tags or comment into words e.g. FR<bla>EE. bogoprep
changes its order so it becomes "FREE <bla>", and bogofilter can catch
these words. Maybe tag recognition isn't fully compatible with HTML
standard, but I tried make it as much compatible as possible with
behaviour of IE (and OE).

What do you think about this script? Any comments would be helpful.

-- 
michau@
Oh no I've set too much / I haven't set enough
I thought that I straced you sleeping / I thought that I straced you run
I think I thought I saw core dumped     [ R.A.M., "Loosing my revision" ]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogoprep.pl.gz
Type: application/octet-stream
Size: 2575 bytes
Desc: 
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20030103/83a26f42/attachment.obj>


More information about the bogofilter-dev mailing list